Section: New Results

Measurement and Detection of Web Tracking

Detecting Web Trackers via Analyzing Invisible Pixels

The Web has become an essential part of our lives: billions are using Web applications on a daily basis and while doing so, are placing digital traces on millions of websites. Such traces allow advertising companies, as well as data brokers to continuously profit from collecting a vast amount of data associated to the users.

Web tracking has been extensively studied over the last decade. To detect tracking, most of the research studies and user tools rely on consumer protection lists. EasyList  [23] and EasyPrivacy  [24] (EL&EP) are the most popular publicly maintained blacklist of know advertising and tracking domains, used by the popular browser extensions AdBlock Plus  [20] and uBlockOrigin  [28]. Disconnect  [22] is another very popular list for detecting domains known for tracking, used in Disconnect browser extension  [21] and in integrated tracking protection of Firefox browser  [25]. Relying on EL&EP or Disconnect became the de facto approach to detect third-party tracking requests in privacy and measurement community. However it is well-known that these lists detect only known tracking and ad-related requests, and a tracker can easily avoid this detection by registering a new domain or changing the parameters of the request.

In this work, to detect trackers, we propose a new technique based on the analysis of invisible pixels (By “invisible pixels” we mean 1x1 pixel images or images without content.). These images are routinely used by trackers in order to send information or third-party cookies back to their servers: the simplest way to do it is to create a URL containing useful information, and to dynamically add an image HTML tag into a webpage. Since invisible pixels do not provide any useful functionality, we consider them perfect suspects for tracking.

By using an Inria cluster and setting up a distributed crawler, we have collected a dataset of invisible pixels from 829,349 webpages. By analyzing this dataset, we observed that invisible pixels are widely used: more than 83% of pages incorporate at least one invisible pixel.

Overall, we made the following key contributions:

  • We define a new classification of Web tracking behaviors based on the analysis of invisible pixels. By analyzing behavior associated to the delivery of invisible pixels, we propose a new fine-grained classification of tracking behaviors, that consists of 8 categories of tracking. To our knowledge, we are the first to analyse tracking behavior based on invisible pixels that are present on 83% of the webpages.

  • We apply our classification to a full dataset and uncover new collaborations between third-party domains. We detect new relationships between third-party domains beyond basic cookie syncing detected in the past. In particular, we discovered that first to third party cookie syncing is the most prevalent tracking behavior performed by 50,812 distinct domains. Finally, we find that 76.23% of requests responsible for tracking originate from loading other resources than invisible images. To our knowledge, we are the first to discover a highly prevalent first to third party syncing behavior detected on 51.54% of all crawled domains.

  • We show that the consumer protection lists cannot be considered as ground truth to identify trackers. We find out that the browser extensions based on EasyList and EasyPrivacy (EL&EP) and Disconnect each miss 22% of tracking requests we detect. Moreover, if we combine all the lists, 238,439 requests originated from 7,773 domains are unknown to these lists and hence still track users on 5,098 webpages even if tracking protection is installed. We also detect instances of cookie syncing in domains unknown to these lists and therefore likely unrelated to advertising. To our knowledge, we are the first to detect that EL&EP and also Disconnect lists used in majority of Web Tracking detection literature are actually missing tracking requests to 7,773 distinct domains.

This working paper [19] is currently under submission at an international conference.

A survey on Browser Fingerprinting

This year, we have conducted a survey on the research performed in the domain of browser fingerprinting, while providing an accessible entry point to newcomers in the field. We explain how this technique works and where it stems from. We analyze the related work in detail to understand the composition of modern fingerprints and see how this technique is currently used online. We systematize existing defense solutions into different categories and detail the current challenges yet to overcome.

A browser fingerprint is a set of information related to a user's device from the hardware to the operating system to the browser and its configuration. Browser fingerprinting refers to the process of collecting information through a web browser to build a fingerprint of a device. Via a script running inside a browser, a server can collect a wide variety of information from public interfaces called Application Programming Interface (API) and HTTP headers. An API is an interface that provides an entry point to specific objects and functions. While some APIs require a permission to be accessed like the microphone or the camera, most of them are freely accessible from any JavaScript script rendering the information collection trivial. Contrarily to other identification techniques like cookies that rely on a unique identifier (ID) directly stored inside the browser, browser fingerprinting is qualified as completely stateless. It does not leave any trace as it does not require the storage of information inside the browser.

The goal of this work is twofold: first, to provide an accessible entry point for newcomers by systematizing existing work, and second, to form the foundations for future research in the domain by eliciting the current challenges yet to overcome. We accomplish these goals with the following contributions:

  • A thorough survey of the research conducted in the domain of browser fingerprinting with a summary of the framework used to evaluate the uniqueness of browser fingerprints and their adoption on the web.

  • An overview of how this technique is currently used in both research and industry.

  • A taxonomy that classifies existing defense mechanisms into different categories, providing a high-level view of the benefits and drawbacks of each of these techniques.

  • A discussion about the current state of browser fingerprinting and the challenges it is currently facing on the science, technological, business, and legislative aspects.

This work has been submitted for publication at an international journal.

Measuring Uniqueness of Browser Extensions and Web Logins

Web browser is the tool people use to navigate through the Web, and privacy research community has studied various forms of browser fingerprinting. Researchers have shown that a user's browser has a number of inherent “physical” characteristics that can be used to uniquely identify her browser and hence to track it across the Web. Fingerprinting of users' devices is similar to physical biometric traits of people, where only physical characteristics are studied.

Similar to previous demonstrations of user uniqueness based on their behavior, behavioral characteristics, such as browser settings and the way people use their browsers can also help to uniquely identify Web users. For example, a user installs web browser extensions she prefers, such as AdBlock, LastPass, or Ghostery to enrich her Web experience. Also, while browsing the Web, she logs in her preferred social networks, such as Gmail, Facebook or LinkedIn. In this work, we study users' uniqueness based on their behavior and preferences on the Web: we analyze how unique are Web users based on their browser extensions and logins.

In this work, we performed the first large-scale study of user uniqueness based on browser extensions and Web logins, collected from more than 16,000 users who visited our website https://extensions.inrialpes.fr/. Our experimental website identifies installed Google Chrome extensions via Web Accessible Resources. and detects websites where the user is logged in by methods that rely on URL redirection and CSP violation reports. Our website is able to detect the presence of 13K Chrome extensions (the number of detected extensions varied monthly between 12,164 and 13,931), covering approximately 28% of all free Chrome extensions (The list of detected extensions and websites are available on our website: https://extensions.inrialpes.fr/faq.php) . We also detect whether the user is connected to one or more of 60 different websites. Our main contributions are:

  • A large scale study on how unique users are based on their browser extensions and website logins. We discovered that 54.86% of users that have installed at least one detectable extension are unique; 19.53% of users are unique among those who have logged into one or more detectable websites; and 89.23% are unique among users with at least one extension and one login. Moreover, we discover that 22.98% of users could be uniquely identified by web logins, even if they disable JavaScript.

  • We study the privacy dilemma on Adblock and privacy extensions, that is, how well these extensions protect their users against trackers and how they also contribute to uniqueness. We evaluate the statement “the more privacy extensions you install, the more unique you are” by analyzing how users' uniqueness increases with the number of privacy extensions she installs; and by evaluating the tradeoff between the privacy gain of the blocking extensions such as Ghostery  [26] and Privacy Badger  [27].

We furthermore show that browser extensions and web logins can be exploited to fingerprint and track users by only checking a limited number of extensions and web logins. We have applied an advanced fingerprinting algorithm  [30] that carefully selects a limited number of extensions and logins. For example, we show that 54.86% of users are unique based on all 16,743 detectable extensions. However, by testing 485 carefully chosen extensions we can identify more than 53.96% of users. Besides, detecting 485 extensions takes only 625ms.

Finally, we give suggestions to the end users as well as website owners and browser vendors on how to protect the users from the fingerprinting based on extensions and logins.

This paper has been published at at WPES international workshop affiliated with ACM CCS 2018 [14].