Since its creation in 2014, the PRIVATICS project-team focusses on privacy protection in the digital world. It includes, on one side, activities that aim at understanding the domain and its evolution, both from theoretical and practical aspects, and, on the other side, activities that aim at designing privacy-enhancing tools and systems. The approach taken in PRIVATICS is fundamentally inter-disciplinary and covers theoretical, legal, economical, sociological and ethical aspects by the means of enriched collaborations with the members of these disciplines.
Privacy is essential to protect individuals, but also to protect the society, for instance to avoid the misuse of personal data to surreptitiously manipulate individuals in elections. In this context, the PRIVATICS team carries out a broad range of research activities: some of them aim at understanding the domain and its evolution, both from the theoretical and practical viewpoints, while others aim at designing privacy-enhancing tools and systems.
Examples of the PRIVATICS team research activities, always with privacy as a common denominator, include: federated machine learning; explainability of automatic decision making systems; user manipulation through dark patterns; identification and protection against web tracking; privacy leaks in IoT, smartphone applications and wireless networks; PDF document sanitization; privacy in digital health tools; digital contact tracing in the TousAntiCovid system; legal considerations in privacy; societal considerations, for instance in the context of video-surveillance systems; and theoretical foundations for privacy, for instance with formal languages for privacy policies.
The domain of privacy and personal data protection is clearly fundamentally multifacted, from scientific and technical aspects to legal, economic, sociological, ethic and cultural aspects. Whenever it makes sense, PRIVATICS will continue to favor interdisciplinarity, through collaborations with colleagues from other disciplines.
One illustrative example is our latest work on privacy-preserving smart-metering [2]. Several countries throughout the world are planning to deploy smart meters in house-holds in the very near future. Traditional electrical meters only measure total consumption on a given period of time (i.e., one month or one year). As such, they do not provide accurate information of when the energy was consumed. Smart meters, instead, monitor and report consumption in intervals of few minutes. They allow the utility provider to monitor, almost in real-time, consumption and possibly adjust generation and prices according to the demand. Billing customers by how much is consumed and at what time of day will probably change consumption habits to help matching energy consumption with production. In the longer term, with the advent of smart appliances, it is expected that the smart grid will remotely control selected appliances to reduce demand. Although smart metering might help improving energy management, it creates many new privacy problems. Smart-meters provide very accurate consumption data to electricity providers. As the interval of data collected by smart meters decreases, the ability to disaggregate low-resolution data increases. Analysing high-resolution consumption data, Non-intrusive Appliance Load Monitoring (NALM) can be used to identify a remarkable number of electric appliances (e.g., water heaters, well pumps, furnace blowers, refrigerators, and air conditioners) employing exhaustive appliance signature libraries. We developed DREAM, DiffeRentially privatE smArt Metering, a scheme that is private under the differential privacy model and therefore provides strong and provable guarantees. With our scheme, an (electricity) supplier can periodically collect data from smart-meters and derive aggregated statistics while learning only limited information about the activities of individual households. For example, a supplier cannot tell from a user's trace when he watched TV or turned on heating.
We believe that another important problem will be related to privacy issues in big data. Public datasets are used in a variety of applications spanning from genome and web usage analysis to location-based and recommendation systems. Publishing such datasets is important since they can help us analyzing and understanding interesting patterns. For example, mobility trajectories have become widely collected in recent years and have opened the possibility to improve our understanding of large-scale social networks by investigating how people exchange information, interact, and develop social interactions. With billions of handsets in use worldwide, the quantity of mobility data is gigantic. When aggregated, they can help understand complex processes, such as the spread of viruses, and build better transportation systems. While the benefits provided by these datasets are indisputable, they unfortunately pose a considerable threat to individual privacy. In fact, mobility trajectories might be used by a malicious attacker to discover potential sensitive information about a user, such as his habits, religion or relationships. Because privacy is so important to people, companies and researchers are reluctant to publish datasets by fear of being held responsible for potential privacy breaches. As a result, only very few of them are actually released and available. This limits our ability to analyze such data to derive information that could benefit the general public. It is now an urgent need to develop Privacy-Preserving Data Analytics (PPDA) systems that collect and transform raw data into a version that is immunized against privacy attacks but that still preserves useful information for data analysis. This is one of the objectives of Privatics. There exists two classes of PPDA according to whether the entity that is collecting and anonymizing the data is trusted or not. In the trusted model, that we refer to as Privacy-Preserving Data Publishing (PPDP), individuals trust the publisher to which they disclose their data. In the untrusted model, that we refer to as Privacy-Preserving Data Collection (PPDC), individuals do not trust the data publisher. They may add some noise to their data to protect sensitive information from the data publisher.
Privacy-Preserving Data Publishing: In the trusted model, individuals trust the data publisher and disclose all their data to it. For example, in a medical scenario, patients give their true information to hospitals to receive proper treatment. It is then the responsibility of the data publisher to protect privacy of the individuals' personal data. To prevent potential data leakage, datasets must be sanitized before possible release. Several proposals have been recently proposed to release private data under the Differential Privacy model [25, 56, 26, 57, 50]. However most of these schemes release a “snapshot” of the datasets at a given period of time. This release often consists of histograms. They can, for example, show the distributions of some pathologies (such as cancer, flu, HIV, hepatitis, etc.) in a given population. For many analytics applications, “snapshots” of data are not enough, and sequential data are required. Furthermore, current work focusses on rather simple data structures, such as numerical data. Release of more complex data, such as graphs, are often also very useful. For example, recommendation systems need the sequences of visited websites or bought items. They also need to analyse people connection graphs to identify the best products to recommend. Network trace analytics also rely on sequences of events to detect anomalies or intrusions. Similarly, traffic analytics applications typically need sequences of visited places of each user. In fact, it is often essential for these applications to know that user A moved from position 1 to position 2, or at least to learn the probability of a move from position 1 to position 2. Histograms would typically represent the number of users in position 1 and position 2, but would not provide the number of users that moved from position 1 to position 2. Due to the inherent sequentiality and high-dimensionality of sequential data, one major challenge of applying current data sanitization solutions on sequential data comes from the uniqueness of sequences (e.g., very few sequences are identical). This fact makes existing techniques result in poor utility. Schemes to privately release data with complex data structures, such as sequential, relational and graph data, are required. This is one the goals of Privatics. In our current work, we address this challenge by employing a variable-length n-gram model, which extracts the essential information of a sequential database in terms of a set of variable-length n - grams [15]. We then intend to extend this approach to more complex data structures.
Privacy-Preserving Data Collection: In the untrusted model, individuals do not trust their data publisher. For example, websites commonly use third party web analytics services, such as Google Analytics to obtain aggregate traffic statistics such as most visited pages, visitors' countries, etc. Similarly, other applications, such as smart metering or targeted advertising applications, are also tracking users in order to derive aggregated information about a particular class of users. Unfortunately, to obtain this aggregate information, services need to track users, resulting in a violation of user privacy. One of our goals is to develop Privacy-Preserving Data Collection solutions. We propose to study whether it is possible to provide efficient collection/aggregation solutions without tracking users, i.e. without getting or learning individual contributions.
The activities of PRIVATICS are not directly related to environmental considerations. However, promoting privacy in a connected world sometimes leads us to promote local data processing, as opposed to massive data collection and big data (e.g., in the case of Internet of Things systems). From this point of view, we believe that our research results are aligned with environmental considerations.
Several of PRIVATICS works had major societal impacts. One can cite:
Additionally, several PRIVATICS members are part of several ethical committees:
In 2021 five PRIVATICS students brillantly defended their PhD on topics ranging from Federated ML, ML for digital health, web tracking, personal data leaks in PDF documents, understandability and accountability of automated decision systems (AKA algorithms).
Additionally, Nataliia Bielova and Mathieu Cunche brillantly defended their HDR in June 2021, respectively entitled:
"Protecting Privacy of Web Users"
and:
"Privacy issues in wireless networks - Every frame you send, they'll be watching you".
In 2021, two PRIVATICS permanent researchers joined the French Data Protection Agency (DPA), CNIL:
These two nomminations are great achievements that highlight the quality of their competences and contributions to the domain of privacy and data protection.
The CLuster Exposure verificAtion (CLEA) protocol, designed by the PRIVATICS team, has been added on June 9th, 2021 to the TousAntiCovid French app, adding presence tracing features to the set of services already provided, including the initial ROBERT contact tracing service deployed one year earlier on June 2nd, 2020. See the New Results section.
In parallel, the TousAntiCovid app progressively became the most popular app in 2021 in terms of number of donwloads in France, with more than 50 Million first downloads on different devices in the AppStore and PlayStore on January 7th, 2022. On December 28th, 2021, 1 Million TAC users have been notified as being "at risk" thanks to the digital tracing features of the app.
PRIVATICS designed several new softwares and platforms in 2021.
On April 2020, the PRIVATICS Inria team (FR) and the Fraunhofer (DE) colleagues designed the CNIL-approved ROBERT privacy preserving exposure notification protocol, used by the French StopCovid/TousAntiCovid national app, and available since June 2nd. The PRIVATICS team also designed the DESIRE protocol on May 2020, as an advanced solution.
The present software is a Proof of Concept of the DÉSIRÉ protocol for Android smartphones.
PRESERVE (Plate-foRme wEb de SEnsibilisation aux pRoblèmes de Vie privéE):
This platform aims to raise users' awareness of privacy issues. It aims to be used as a front for several works of the Privatics team as well as for collaborations and actions. The first version implements tools in order to inspect location history. Specifically, this version implements [hal-02421828] where a user is able to inspect the private and sensitive information inferred from its own location data. This platform will be enriched with new functionalities in the future.
On March 2021, PRIVATICS proposed the CLuster Exposure verificAtion (CLÉA) protocol, meant to warn the participants of a private event (e.g., wedding or private party) or the persons present in a commercial or public location (e.g., bar, restaurant, sport center) of a risk because a certain number of people who were present at the same time have been tested COVID+ 32. It is based:
This smartphone application is used to scan a QR code, to store it locally for the next 14 days, and to perform periodic risk analyses, in a decentralized manner, on the smartphone. In order to enable this decentralized risk analysis, information about clusters (i.e., the location pseudonyms and timing information) needs to be disclosed. We believe this is an acceptable downside because this information is not per se sensitive health data (it does not reveal any user health information to an eavesdropper), although it can be considered as personal data (it is indirectly linked to the location manager). The CLÉA version being deployed is limited to the synchronous scan of a QR code, for situations where a user scans a QR code upon entering an event or location (e.g., a restaurant). Asynchronous scans where the QR code is for instance associated to a transportation or event ticket are not considered.
Finally, the CLÉA protocol is also meant to be used by the location employees in order to warn them if their work place is qualified as cluster, or on the opposite to let them upload information to the server if they are themselves tested COVID+.
Beginning of January 2022, nineteen months after its launch, the TousAntiCovid application has been downloaded more than 50 Million times (making it the most popular app in France ever), registered (required to benefit from contact tracing) more than 40 Million times, 1.5 Million users uploaded their contact proximity, and more than 1.8 Million users have received an "at risk" notification. This success makes the TousAntiCovid app a central component of the French health strategy in the COVID-19 fight, as a multi-service application for the service of the French citizen.
More information: CLEA gitlab repository (specifications and reference implementation), and 8
In Machine Learning, several entities may want to collaborate in order to improve their local model accuracy. In traditional machine learning, such collaboration requires to first store all entities’ data on a centralized server before training the model on it. Such data centralization might be problematic when the data are sensitive and data privacy is required. Instead of sharing the training data, Federated Learning shares the model parameters between a server, which plays the role of aggregator, and the participating entities. More specifically, the server sends at each round the global model to some participants (downstream). These participants then update the received model with their local data and sends back the updated gradients’ vector to the server (upstream). The server then aggregates all the participants’ updates to obtain the new global model. This operation is repeated until the global model converges. Although Federated Learning improves privacy, it is not perfect. In fact, sharing gradients computed by individual parties can leak information about their private training data. Several recent attacks have demonstrated that a sufficiently skilled adversary, who can capture the model updates (gradients) sent by individual parties, can infer whether a specific record or a group property is present in the dataset of a specific party. Moreover, complete training samples can also be reconstructed purely from the captured gradients. Furthermore, Federated Learning is not only vulnerable to privacy attacks, it is also vulnerable to poisoning attacks which can drastically decrease the model accuracy. Finally, Federated Learning incurs large communication costs during upstream/downstream exchanges between the server and the parties. This can be problematic for applications based on bandwidth and energy-constrained devices as it is the case for mobile systems, for instance. In this work, we propose three bandwidth efficient schemes to reduce the bandwidth costs up to 99.9
Bibliography: 6
With the widespread adoption of the quantified self movement, an increasing number of users rely on mobile applications to monitor their physical activity through their smartphones. Granting to applications a direct access to sensor data expose users to privacy risks. Indeed, usually these motion sensor data are transmitted to analytics applications hosted on the cloud leveraging machine learning models to provide feedback on their health to users. However, nothing prevents the service provider to infer private and sensitive information about a user such as health or demographic attributes. In this work, we present DySan, a privacy-preserving framework to sanitize motion sensor data against unwanted sensitive inferences (i.e., improving privacy) while limiting the loss of accuracy on the physical activity monitoring (i.e., maintaining data utility). To ensure a good trade-off between utility and privacy, DySan leverages on the framework of Generative Adversarial Network (GAN) to sanitize the sensor data. More precisely, by learning in a competitive manner several networks, DySan is able to build models that sanitize motion data against inferences on a specified sensitive attribute (e.g., gender) while maintaining a high accuracy on activity recognition. In addition, DySan dynamically selects the sanitizing model which maximize the privacy according to the incoming data. Experiments conducted on real datasets demonstrate that DySan can drasticallylimit the gender inference to 47% while only reducing the accuracy of activity recognition by 3%.
Bibliography: 15
Genome-Wide Association Studies (GWAS) identify the genomic variations that are statistically associated with a particular phenotype (e.g., a disease). The confidence in GWAS results increases with the number of genomes analyzed, which encourages federated computations where biocenters would periodically share the genomes they have sequenced. However, for economical and legal reasons, this collaboration will only happen if biocenters cannot learn each others' data. In addition, GWAS releases should not jeopardize the privacy of the individuals whose genomes are used. We introduce DyPS, a novel framework to conduct dynamic privacy-preserving federated GWAS. DyPS leverages a Trusted Execution Environment to secure dynamic GWAS computations. Moreover, DyPS uses a scaling mechanism to speed up the releases of GWAS results according to the evolving number of genomes used in the study, even if individuals retract their participation consent. Lastly, DyPS also tolerates up to all-but-one colluding biocenters without privacy leaks. We implemented and extensively evaluated DyPS through several scenarios involving more than 6 million simulated genomes and up to 35,000 real genomes. Our evaluation shows that DyPS updates test statistics with a reasonable additional request processing delay (11% longer) compared to an approach that would update them with minimal delay but would lead to 8% of the genomes not being protected. In addition, DyPS can result in the same amount of aggregate statistics as a static release (i.e., at the end of the study), but can produce up to 2.6 times more statistics information during earlier dynamic releases. Besides, we show that DyPS can support a larger number of genomes and SNP positions without any significant performance penalty.
Bibliography: 16
Federated Learning (FL) is a collaborative scheme to train a learning model across multiple participants without sharing data. While FL is a clear step forward towards enforcing users' privacy, different inference attacks have been developed. In this work, we quantify the utility and privacy trade-off of a FL scheme using private personalized layers. While this scheme has been proposed as local adaptation to improve the accuracy of the model through local personalization, it has also the advantage to minimize the information about the model exchanged with the server. However, the privacy of such a scheme has never been quantified. Our evaluations using motion sensor dataset show that personalized layers speed up the convergence of the model and slightly improve the accuracy for all users compared to a standard FL scheme while better preventing both attribute and membership inferences compared to a FL scheme using local differential privacy.
Bibliography: 19
The recent development of Internet of Things (IoT) has democratized activity monitoring. Even if the data collected can be useful for healthcare, sharing this sensitive information exposes users to privacy threats and re-identification. This work presents two approaches to anonymize the motion sensor data. The first is an ex-tension of an earlier work based on filtering in the time-frequency plane and convolutional neural network; and the second is based on handcrafted features extracted from the zeros distribution of the time-frequency representation. The two approaches are evaluated on a public dataset to assess the accuracy of activity recognition and user re-identification. With the first approach we obtained an accuracy rate in activity recognition of 73% while limiting the identity recognition to an accuracy rate of 30% which corresponds to an activity identity ratio of 2.4. With the second approach we succeeded in improving the activity and identity ratio to 2.67 by attaining an accuracy rate in activity recognition of 80% while maintaining there-identification rate at 30%.
Bibliography: [hal-03354723]
Explanations for algorithmic decision systems can take different forms, they can target different types of users with different goals. One of the main challenges in this area is therefore to devise explanation methods that can accommodate this variety of situations. A first step to address this challenge is to allow explainees to express their needs in the most convenient way, depending on their level of expertise and motivation. This work presents a solution to this problem based on a multi-layered approach allowing users to express their requests for explanations at different levels of abstraction. We illustrate the approach with the application of a proof-of-concept system called IBEX to two case studies.
Bibliography: 18
In this work, we argue that the possibility of contesting the results of Algorithmic Decision Systems (ADS) is a key requirement for ADS used to make decisions with a high impact on individuals. We discuss the limitations of explanations and motivate the need for better facilities to contest or justify the results of an ADS. While the goal of an explanation is to make it possible for a human being to understand, the goal of a justification is to convince that the decision is good or appropriate. To claim that a result is good, it is necessary (1) to refer to an independent definition of what a good result is (the norm) and (2) to provide evidence that the norm applies to the case. Based on these definitions, we present a challenge and justification framework including three types of norms, a proof-of-concept implementation of this framework and its application to a credit decision system.
Bibliography: 12
We have set up a collaboration with a legal scholar Cristiana Santos (University of Utrecht, The Netherlands) to understand the gaps and inconsistencies between law and technology – in particular, we set up an interdisciplinary collaboration on GDPR & ePrivacy compliance for consent banners and tracking technologies. Additionally, we have collaborated with the researher in Design/UX, Colin M. Gray (Purdue University, USA) to study how legal and technical requirements for valid consent can be implemented in the design of consent banners.
User engagement with data privacy and security through consent banners has become a ubiquitous part of interacting with inter- net services. While previous work has addressed consent banners from either interaction design, legal, and ethics-focused perspec- tives, little research addresses the connections among multiple disciplinary approaches, including tensions and opportunities that transcend disciplinary boundaries. In this paper, we draw together perspectives and commentary from HCI, design, privacy and data protection, and legal research communities, using the language and strategies of “dark patterns” to perform an interaction criticism read- ing of three different types of consent banners. Our analysis builds upon designer, interface, user, and social context lenses to raise tensions and synergies that arise together in complex, contingent, and conflicting ways in the act of designing consent banners. We conclude with opportunities for transdisciplinary dialogue across legal, ethical, computer science, and interactive systems scholarship to translate matters of ethical concern into public policy.
Bibliography: 17
Consent Management Platforms under the GDPR: processors and/or controllers?
Consen Management Providers(CMPs) provide consent pop- ups that are embedded in ever more websites over time to enable stream- lined compliance with the legal requirements for consent mandated by the ePrivacy Directive and the General Data Protection Regulation (GDPR). They implement the standard for consent collection from the Trans- parency and Consent Framework (TCF) (current version v2.0) proposed by the European branch of the Interactive Advertising Bureau (IAB Europe). Although the IAB’s TCF specifications characterize CMPs as data processors, CMPs factual activities often qualifies them as data controllers instead. Discerning their clear role is crucial since compliance obligations and CMPs liability depend on their accurate characteriza- tion. We perform empirical experiments with two major CMP providers in the EU: Quantcast and OneTrust and paired with a legal analysis. We conclude that CMPs process personal data, and we identify multiple scenarios wherein CMPs are controllers.
Bibliography: 33
Searching the Web to find doctors and make appointments online is a common practice nowadays. However, simply visiting a doctors website might disclose health related information. As the GDPR only allows processing of health data with explicit user consent, health related websites must ask consent before any data processing, in particular when they embed third party trackers. Admittedly, it is very hard for owners of such websites to both detect the complex tracking practices that exist today and to ensure legal compliance. In this work, we present ERNIE 38, a browser extension we designed to visualise six state-of-the-art tracking techniques based on cookies. Using ERNIE, we analysed 385 health related websites that users would visit when searching for doctors in Germany, Austria, France, Belgium, and Ireland. More specifically, we explored the tracking behavior before any interaction with the consent pop-up and after rejection of cookies on websites of doctors, hospitals, and health related online phone-books. We found that at least one form of tracking occurs on 62% of the websites before interacting with the consent pop-up, and 15% of websites include tracking after rejection. Finally, we performed a detailed technical and legal analysis of three health related websites that demonstrate impactful legal violations. This work shows that while, from a legal point of view, health related websites are more privacy-sensitive than other kinds of websites, they are exposed to the same technical difficulties to implement a legally compliant website. We believe EX, the browser extension we developed, to be an invaluable tool for policy-makers and regulators to improve detection and visualization of the complex tracking techniques used on these websites.
Abstract:
Thanks to the exponential growth of the Internet, citizens became more and more exposed to personal information leakage in their digital lives. This trend begun 20 years ago with the development of Internet. The advent of smartphones, our personal assistant always connected and equipped with many sensors, further reinforced the tendency. Today the craze for quantified-self wearable devices, smart home appliances and more generally connected devices, enable the collection of personal information – sometimes very sensitive – in domains that were so far out of reach. However, little is known about the actual practices in terms of security, confidentiality, or data exchanges. The end-user as well as the regulator are therefore prisoner of a highly asymmetric system.
The IOTics project gathers four research teams working on security, privacy and digital economy, plus the CNIL, the French data protection agency. It focusses on connected devices and follows three directions: the analysis of the internal behavior in terms of personal information leakage of a set of connected devices; the analysis of the privacy policies provided (or not) by the device manufacturers; and the analysis of the underlying ecosystem. By giving transparent information of hidden behaviors, by highlighting good and bad practices, the IOTics project aims at reducing information asymmetry, at giving back control to the end-users and hopefully encouraging stakeholders to change practices.
PRIVATICS is in charged of the CNIL-Inria collaboration. This collaboration was at the origin of the Mobilitics project and it is now at the source of many discussions and collaborations on data anoymisation, risk analysis, consent or IoT Privacy.
PRIVATICS and CNIL are both actively involved on the IoTics project, that is the follow-up of the Mobilitics projects. The goal of the Mobilitics project was to study information leakage in mobile phones. The goal of IoTics is to extend this work to IoT and connected devices.
PRIVATICS is also in charged of the organization of the CNIL-Inria prize that is awarded every year to an outstanding publication in the field of data privacy.
PRESERVE (Plate-foRme wEb de SEnsibilisation aux pRoblèmes de Vie privéE):
Participant: Antoine Boutet, Adrien Baud.
The goal of the PRESERVE ADT is to design a platform whose goal is to raise users' awareness of privacy issues. The first version implements tools in order to inspect location history. Specifically, this version implements [hal-02421828] where a user is able to inspect the private and sensitive information inferred from its own location data.
Most of the PRIVATICS members' lectures are given at INSA-Lyon (Antoine Boutet and Mathieu Cunche are associated professor at INSA-Lyon), at Grenoble Alps University (Claude Castelluccia, Vincent Roca and Cédric Lauradoux), and Université Côte d'Azur (Nataliia Bielova).
Most of the PRIVATICS members' lectures are on the foundations of computer science, security and privacy, as well as networking. The lectures are given to computer science students but also to business school students and to laws students. The Privatics members have created original content for security (ressi2019 site) and for anonymisation (ressi2020 site).
Details of lectures:
PhD defended in 2021:
On-going PhDs: