Section: New Results

Foundations of privacy and quantitative information flow

Privacy and information flow have the common goal of trying to protect sensitive information. Comete focuses in particular on the potential leaks due to inference from data that are public, or anyway available to the adversary. We consider the probabilistic aspects, and we use concepts and tools from information theory.

Black-box Leakage Estimation

In [16] we have considered the problem of measuring how much a system reveals about its secret inputs under the black-box setting. Black-box means that we assume no prior knowledge of the system's internals: the idea is to run the system for choices of secrets and measure its leakage from the respective outputs. Our goal was to estimate the Bayes risk, from which one can derive some of the most popular leakage measures (e.g., min-entropy, additive, and multiplicative leakage). The state-of-the-art method for estimating these leakage measures is the frequentist paradigm, which approximates the system's internals by looking at the frequencies of its inputs and outputs. Unfortunately, this does not scale for systems with large output spaces, where it would require too many input-output examples. Consequently, it also cannot be applied to systems with continuous outputs (e.g., time side channels, network traffic). In [16] we have exploited an analogy between Machine Learning (ML) and black-box leakage estimation to show that the Bayes risk of a system can be estimated by using a class of ML methods: the universally consistent learning rules; these rules can exploit patterns in the input-output examples to improve the estimates' convergence, while retaining formal optimality guarantees. We have focused on a set of them, the nearest neighbor rules; we show that they significantly reduce the number of black-box queries required for a precise estimation whenever nearby outputs tend to be produced by the same secret; furthermore, some of them can tackle systems with continuous outputs. We have illustrated the applicability of these techniques on both synthetic and real-world data, and we compared them with the state-of-the-art tool, leakiEst, which is based on the frequentist approach.

An Axiomatization of Information Flow Measures

Quantitative information flow aims to assess and control the leakage of sensitive information by computer systems. A key insight in this area is that no single leakage measure is appropriate in all operational scenarios; as a result, many leakage measures have been proposed, with many different properties. To clarify this complex situation, in [11] we have studied information leakage axiomatically, showing important dependencies among different axioms. We have also established a completeness result about the g-leakage family, showing that any leakage measure satisfying certain intuitively-reasonable properties can be expressed as a g-leakage.

Comparing systems: max-case refinement orders and application to differential privacy

Quantitative Information Flow (QIF) and Differential Privacy (DP) are both concerned with the protection of sensitive information, but they are rather different approaches. In particular, QIF considers the expected probability of a successful attack, while DP (in both its standard and local versions) is a max-case measure, in the sense that it is compromised by the existence of a possible attack, regardless of its probability. Comparing systems is a fundamental task in these areas: one wishes to guarantee that replacing a system A by a system B is a safe operation, that is the privacy of B is no-worse than that of A. In QIF, a refinement order provides strong such guarantees, while in DP mechanisms are typically compared (wrt privacy) based on the ε privacy parameter that they provide.

In [15] we have explored a variety of refinement orders, inspired by the one of QIF, providing precise guarantees for max-case leakage. We have studied simple structural ways of characterizing them, the relation between them, efficient methods for verifying them and their lattice properties. Moreover, we have applied these orders in the task of comparing DP mechanisms, raising the question of whether the order based on ε provides strong privacy guarantees. We have shown that, while it is often the case for mechanisms of the same “family” (geometric, randomised response, etc.), it rarely holds across different families.

A Logical Characterization of Differential Privacy

Differential privacy (DP) is a formal definition of privacy ensuring that sensitive information relative to individuals cannot be inferred by querying a database. In [12], we have exploited a modeling of this framework via labeled Markov Chains (LMCs) to provide a logical characterization of differential privacy: we have considered a probabilistic variant of the Hennessy-Milner logic and we have defined a syntactical distance on formulae in it measuring their syntactic disparities. Then, we have defined a trace distance on LMCs in terms of the syntactic distance between the sets of formulae satisfied by them. We have proved that such distance corresponds to the level of privacy of the LMCs. Moreover, we have used the distance on formulae to define a real-valued semantics for them, from which we have obtained a logical characterization of weak anonymity: the level of anonymity is measured in terms of the smallest formula distinguishing the considered LMCs. Then, we have focused on bisimulation semantics on nondeterministic probabilistic processes and we have provided a logical characterization of generalized bisimulation metrics, namely those defined via the generalized Kantorovich lifting. Our characterization is based on the notion of mimicking formula of a process and the syntactic distance on formulae, where the former captures the observable behavior of the corresponding process and allows us to characterize bisimilarity. We have shown that the generalized bisimulation distance on processes is equal to the syntactic distance on their mimicking formulae. Moreover, we have used the distance on mimicking formulae to obtain bounds on differential privacy.

Geo-indistinguishability vs Utility in Mobility-based Geographic Datasets

In [17] we have explored the trade-offs between privacy and utility in mobility-based geographic datasets. Our aim was to find out whether it is possible to protect the privacy of the users in a dataset while, at the same time, maintaining intact the utility of the information that it contains. In particular, we have focused on geo-indistinguishability as a privacy-preserving sanitization methodology, and we have evaluated its effects on the utility of the Geolife dataset. We have tested the sanitized dataset in two real world scenarios: 1. Deploying an infrastructure of WiFi hotspots to offload the mobile traffic of users living, working, or commuting in a wide geographic area; 2. Simulating the spreading of a gossip-based epidemic as the outcome of a device-to-device communication protocol. We have shown the extent to which the current geo-indistinguishability techniques trade privacy for utility in real world applications and we focus on their effects at the levels of the population as a whole and of single individuals.

Utility-Preserving Privacy Mechanisms for Counting Queries

Differential privacy(DP) and local differential privacy(LPD) are frameworks to protect sensitive information in data collections. They are both based on obfuscation. In DP the noise is added to the result of queries on the dataset, whereas in LPD the noise is added directly on the individual records, before being collected. The main advantage of LPD with respect to DP is that it does not need to assume a trusted third party. The main disadvantage is that the trade-off between privacy and utility is usually worse than in DP, and typically to retrieve reasonably good statistics from the locally sanitized data it is necessary to have a huge collection of them. In [25], we focus on the problem of estimating counting queries from collections of noisy answers, and we propose a variant of LDP based on the addition of geometric noise. Our main result is that the geometric noise has a better statistical utility than other LPD mechanisms from the literature.

Differential Inference Testing: A Practical Approach to Evaluate Sanitizations of Datasets

In order to protect individuals’ privacy, data have to be “well-sanitized” before sharing them, i.e. one has to remove any personal information before sharing data. However, it is not always clear when data shall be deemed well-sanitized. In this paper, we argue that the evaluation of sanitized data should be based on whether the data allows the inference of sensitive information that is specific to an individual, instead of being centered around the concept of re-identification. In [20] we have proposed a framework to evaluate the effectiveness of different sanitization techniques on a given dataset by measuring how much an individual’s record from the sanitized dataset influences the inference of his/her own sensitive attribute. Our intent was not to accurately predict any sensitive attribute but rather to measure the impact of a single record on the inference of sensitive information. We have demonstrated our approach by sanitizing two real datasets in different privacy models and evaluate/compare each sanitized dataset in our framework.