Our research aims at models, algorithms and tools for highly efficient, easy-to-use data and knowledge management; throughout our research, performance at scale is a core concern, which we address, among other techniques, by designing algorithms for a cloud (massively parallel) setting. In addition, we explore and mine rich data via machine learning techniques. Our scientific contributions fall into four interconnected areas:
Big Data applications increasingly involve diverse data sources, such
as: structured or unstructured documents, data graphs, relational
databases etc., and it is often impractical to load (consolidate)
diverse data sources in a single repository. Instead, interesting data
sources need to be exploited “as they are”, with the added
value of the data being realized, especially through the ability to
combine (join) together data from several sources.
Systems capable of exploiting diverse Big Data in this fashion are usually termed
polystores. However, a current limitation of polystores is that data stays captive of its original storage system,
which may limit the data exploitation performance. We work to devise highly efficient storage systems
for heterogeneous data across a variety of data stores.
As the world's affairs get increasingly more digital, a large and varied set of data sources becomes available: they are either structured databases, such as government-gathered data (demographics, economics, taxes, elections, ...), legal records, stock quotes for specific companies, un-structured or semi-structured, including in particular graph data, sometimes endowed with semantics (see e.g., the Linked Open Data cloud). Modern data management applications, such as data journalism, are eager to combine in innovative ways both static and dynamic information coming from structured, semi-structured, and unstructured databases and social feeds. However, current content management tools for this task are not suited for the task, in particular when they require a lengthy rigid cycle of data integration and consolidation in a warehouse. Thus, we need flexible tools allowing us to interconnect various kinds of data sources and query them together.
We investigate methods for finding useful information in large datasets to provide support for investigative journalism and not only. For example, real-world events such as elections, public demonstrations, disclosures of illegal or surprising activities, etc., are mirrored in new data items being created and added to the global corpus of available information. Making sense of this wealth of data by providing a natural language question-answering framework will facilitate the work of journalists, but it can also be extremely useful to non-technical users in general.
In the Big Data era, we are faced with an increasing gap between the fast growth of data and the limited human ability to comprehend data. Consequently, there has been a growing demand for data management tools that can bridge this gap and help users retrieve high-value content from data more effectively. To respond to such user information needs, we aim to build interactive data exploration as a new database service, using an " explore-by-example " approach.
Semantic graphs, including data and knowledge, are hard to apprehend for users due to the complexity of their structure and, often to their large volumes. To help tame this complexity, our research follows several avenues. First, we build compact summaries of Semantic Web (RDF) graphs suited for a first-sight interaction with the data. Second, we devise fully automated methods of exploring RDF graphs using interesting aggregate queries, which, when evaluated over a given input graph, yield interesting results (with interestingness understood in a formal, statistical sense). Third, we study the exploration of highly heterogeneous data graphs resulting from integrating structured, semi-structured, and unstructured (text) data. In this context, we develop data abstraction methods, showing the structure of any dataset to a novice user, as well as searching on the graph through (
Data analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take user performance goals and budgetary constraints for a task collectively referred to as task objectives, and automatically configure an analytic job to achieve the objectives. Our goal is to develop a data analytics optimizer that can automatically determine a cluster configuration with a suitable number of cores and other runtime system parameters that best meet the task objectives. To achieve this, we also need to design a multi-objective optimizer that constructs a Pareto optimal set of job configurations for task-specific objectives and recommends new job configurations to best meet these objectives.
Database engines are migrating to the cloud to leverage the opportunities for efficient resource management by adapting to the variations and heterogeneity of the workloads. Resource management in a virtualized setting, like the cloud, must be enforced in a performance-efficient manner to avoid introducing overheads to the execution. We design elastic systems that change their configuration at runtime with minimal cost to adapt to the workload every time. Changes in the design include both different resource allocations and different data layouts. We consider different workloads, including transactional, analytical, and mixed, and we study the performance implications on different configurations to propose a set of adaptive algorithms.
Argumentation appears when we evaluate the validity of new ideas, convince an addressee, or solve a difference of opinion. An argument contains a statement to be validated (a proposition also called claim or conclusion), a set of backing propositions (called premises, which should be accepted ideas), and a logical connection between all the pieces of information presented that allows the inference of the conclusion. In our work, we focus on fallacious arguments, where evidence does not prove or disprove the claim, for example, in an "ad hominem" argument, a claim is declared false because the person making it has a character flaw. We study the impact of fallacies in online discussions and show the need for improving tools for their detection. In addition, we look into detecting verifiable claims made by politicians. We started a collaboration with RadioFrance and with Wikidébats, a debate platform focused on proving quality arguments for controversial topics.
We are witnessing a massive shift in the way people consume information. In the past, people had an active role in selecting the news they read. More recently, the information started to appear on people's social media feeds as a byproduct of one's social relations. We see a new shift brought by the emergence of online advertising platforms where third parties can pay ad platforms to show specific information to particular groups of people through paid targeted ads. AI-driven algorithms power these targeting technologies. Our goal is to study the risks with AI-driven information targeting at three levels: (1) human-level–in which conditions targeted information can influence an individual's beliefs; (2) algorithmic- level–in which conditions AI-driven targeting algorithms can exploit people's vulnerabilities; and (3) platform- level–are targeting technologies leading to biases in the quality of information different groups of people receive and assimilate. Then, we will use this understanding to propose protection mechanisms for platforms, regulators, and users.
Cloud computing services are strongly developing and more and more companies and institutions resort to running their computations in the cloud, in order to avoid the hassle of running their own infrastructure. Today’s cloud service providers guarantee machine availabilities in their Service Level Agreement (SLA), without any guarantees on performance measures according to a specific cost budget. Running analytics on big data systems require the user not to only reserve the suitable cloud instances over which the big data system will be running, but also setting many system parameters like the degree of parallelism and granularity of scheduling. Chosing values for these parameters, and chosing cloud instances need to meet user objectives regarding latency, throughput and cost measures, which is a complex task if it’s done manually by the user. Hence, we need need to transform cloud service models from availabily to user performance objective rises and leads to the problem of multi-objective optimization. Research carried out in the team within the ERC project “Big and Fast Data Analytics” aims to develop a novel optimization framework for providing guarantees on the performance while controlling the cost of data processing in the cloud.
Modern journalism increasingly relies on content management technologies in order to represent, store, and query source data and media objects themselves. Writing news articles increasingly requires consulting several sources, interpreting their findings in context, and crossing links between related sources of information. Cedar research results directly applicable to this area provide techniques and tools for rich Web content warehouse management. Within the ANR ContentCheck project, and following through the SourcesSay AI Chair, we work to devise concrete algorithms and platforms to help journalists perform their work better and/or faster. This work is in collaboration with the journalists from RadioFrance, the team Le vrai du faux.
Political discussions revolve around ideological conflicts that often split the audience into two opposing parties. Both parties try to win the argument by bringing forward information. However, often this information is misleading, and its dissemination employs propaganda techniques. We investigate the impact of propaganda in online forums and we study a particular type of propagandist content, the fallacious argument. We show that identifying such arguments remains a difficult task, but one of high importance because of the pervasiveness of this type of discourse. We also explore trends around the diffusion and consumption of propaganda and how this can impact or be a reflection of society.
The enormous financial success of online advertising platforms is partially due to the precise targeting features they offer. Ad platforms collect large amounts of data on users and use powerful AI-driven algorithms to infer users’ fine-grain interests and demographics, which they make available to advertisers to target users. For instance, advertisers can target groups of users as small as tens or hundreds and as specific as ‘‘people interested in anti-abortion movements that have a particular education level’’. Ad platforms also employ AI-driven targeting algorithms to predict how ‘‘relevant’’ ads are to particular groups of people to decide to whom to deliver them. While these targeting technologies are creating opportunities for businesses to reach interested parties and lead to economic growth, they also open the way for interested groups to use user’s data to manipulate them by targeting messages that resonate with each user.
Our work on Big Data and AI techniques applied to data journalism and fact-checking have attracted attention beyond our community and was disseminated in general-audience settings, for instance through I. Manolescu's participation in panels at Médias en Seine, at the Colloque Morgenstern at Inria Sophia, and through invited keynotes, e.g., at DEBS 2022 and DASFAA 2022.
Our work in the SourcesSay project (Section 8.1.1), on propaganda detection (Section 8.2.1), and on ad transparency (Section 8.5), goes towards making information sharing on the Web more transparent and more trustworthy.
Quentin Massonnat (M1 intern, Ecole Polytechnique, advised by O.Balalau and I. Manolescu) received for his M1 thesis the Prix du Centre de Recherche from Ecole Polytechnique.
The team has started a collaboration with RadioFrance, the national radio operator, developing a new tool for automatically detecting and verifying (when possible) statistic and other claims. The tool has been made available to journalists who already use it, and it has lead to several international publications 14, 13. The team is grateful for the support provided by Inria and our research center towards our collaboration with RadioFrance.
The team finds important to thank the outstanding efforts made by the Inria Commission d'Evaluation, organizing and participating to Inria hiring and promotion committees, keeping us, researchers, meticulously informed, and upholding the moral and intellectual values we are collectively proud of, and which define our institute.
Work carried within the ANR AI Chair SourcesSay project has focused on developing a platform for integrating arbitrary heterogeneous data into a graph, then exploring and querying that graph in a simple, intuitive manner through keyword search 11. The main technical challenges are: (i) how to interconnect structured and semi-structured data sources? We address this through information extraction (when an entity appears in two data sources or two places in the same graph, we only create one node, thus interlinking the two locations) and through similarity comparisons; (ii) how to find all connections between nodes matching specific search criteria, or certain keywords? The question is particularly challenging in our context since ConnectionLens graphs can be pretty large, and query answers can traverse edges in both directions.
In this context, the following new contributions have been brought:
ConnectionLens is available online at: https://gitlab.inria.fr/cedar/connection-lens.
To strengthen public trust and counter disinformation, computational fact-checking, leveraging digital data sources, attracts interest from journalists and the computer science community. A particular class of interesting data sources comprises statistics, that is, numerical data compiled mostly by governments, administrations, and international organizations. Statistics are often multidimensional datasets, where multiple dimensions characterize one value and the dimensions may be organized in hierarchies. To address this challenge we developed STATCHECK, a statistic fact-checking system, in collaboration with RadioFrance. The technical novelty of STATCHECK is twofold: (i) we focus on multidimensional, complex-structure statistics, which have received little attention so far, despite their practical importance; and (ii) novel statistical claim extraction modules for French, an area where few resources exist. We validate the efficiency and quality of our system on large statistic datasets (hundreds of millions of facts), including the complete INSEE (French) and Eurostat (European Union) datasets, as well as French presidential election debates 13, 14.
Humans use argumentation daily to evaluate the validity of new ideas, convince an addressee, or solve a difference of opinion. An argument contains a statement to be validated (a proposition also called claim or conclusion), a set of backing propositions (called premises, which should be accepted ideas), and a logical connection between all the pieces of information presented that allows the inference of the conclusion. In this work, we will focus on fallacies: weak arguments that seem convincing, however, their evidence does not prove or disprove the argument's conclusion.
Fallacy detection is part of argumentation mining, the area of natural language processing dedicated to extracting, summarizing, and reasoning over human arguments. The task is closely related to propaganda detection, where propaganda consists of a set of manipulative techniques, such as fallacies, used in a political context to enforce an agenda 2.
In the past, we have worked on propaganda 2 and fallacy detection 10. We continue this work with a CIFRE PhD that started this year, a collaboration between the Amundi company, Inria and Télécom Paris. This thesis aims to improve fallacy detection in natural language by leveraging both language patterns and additional information, such as common sense knowledge, encyclopedic knowledge and logical rules. To achieve this we will focus on how fallacies can be represented and how we can classify reasoning patterns in argumentation. The interest of Amundi is in how argumentation can be applied to finding examples of greenwashing.
Online citizen participation platforms allow large numbers of contributors to be involved in public decision-making, overcoming limiting factors for their offline counterparts, such as their geographical position.
However, for large groups of contributors to collaborate and co-construct joint, well-elaborated proposals, we need to provide tools for users and decision-makers to navigate and understand high volumes of content.
To achieve this,
we introduce an approach based on natural language processing
to detect pairs of contradictory and equivalent proposals in online citizen participation contexts.
We apply this approach to two major national citizen consultations, namely the République Numérique and Revenu Universel d'Activité consultations. Our method leverages a Transformer-based classifier, fine-tuned on natural language inference datasets and on two weakly labeled datasets created using data from these consultations. We also address the classification problem on large texts by proposing alternative strategies explicitly designed for texts containing more than one sentence.
Finally, we highlight the great potential of our tool in the analysis, synthesis, and recommendation of contributions to citizen participation platforms. This work is currently submitted for publication.
Open Information Extraction (OIE) is the task of extracting tuples of the form (subject, predicate, object), without any knowledge of the type and lexical form of the predicate, the subject, or the object. In this work, we focus on improving OIE quality by exploiting domain knowledge about the subject and object. More precisely, knowing that the subjects and objects in sentences are oftentimes named entities, we explore how to inject constraints in the extraction through constrained inference and constraint-aware training. Our work leverages the state-of-the-art OpenIE6 platform, which we adapt to our setting. Through a carefully constructed training dataset and constrained training, we obtain a
Citations are crucial in scientific works as they help position a new publication. Each citation carries a particular intent, for example, to highlight the importance of a problem or to compare against results provided by another method. The authors' intent when making a new citation has been studied to understand the evolution of a field over time or to make recommendations for further citations. In this work, we address the task of citation intent prediction from a new perspective. In addition to textual clues present in the citation phrase, we also consider the citation graph, leveraging high-level information of citation patterns. In this novel setting, we perform a thorough experimental evaluation of graph-based models for intent prediction. We show that our model, GraphCite, improves significantly upon models that consider only the citation phrase 16.
The first topic we explored in our work is automatically generating suitable questions for existing knowledge bases. A knowledge graph (KB) is represented by a set of triples, where each triple is composed of a subject, predicate, and object. Our question generation model aims to generate a set of answerable questions from a given knowledge graph KB, where each question corresponds to a subgraph of KB. We are investigating training Transformer networks to generate questions from knowledge graphs. This approach is based on existing datasets that match questions to subgraphs (e.g., SimpleQuestions, GraphQuestions, GrailQA, etc.). The subgraphs in these datasets are usually from existing KBs such as FreeBase and WikiData. An important challenge of this approach is that the neural network usually fails to predict the correct sentences under zero-shot settings, i.e. when it encounters unseen predicates or entities in the test set. This problem is especially acute in the case of zero-shot predicates. We intend to use unsupervised methods for the neural network to learn to understand the unseen predicates and entities to help the question generation under zero-shot settings.
In parallel, we are aligning Question-Answering datasets across similar KBs. We hope that this work will help us have larger training sets for graph-to-text generation of questions. At the core, this problem involves aligning Freebase's entities, classes, and predicates to those of YAGO4. This is especially interesting because (i) even though YAGO has been around for quite some time, there is a dearth of QA datasets on it, and (ii) there are many QA datasets on Freebase, but Google is no longer maintaining Freebase. We have used the paraphrase model of BERT for computing predicate matching with some success. Inspired by previous works, we have also developed a Greedy Matching algorithm for iteratively aligning the two KGs. Going forward, we will now evaluate the Bert model's performance, improve the Greedy Matcher's results, and finally generate a QA dataset on YAGO4.
Graph data is generally stored and processed using two main approaches: (i) extending existing relational database management systems (RDBMSs) with graph capabilities, and (ii) through native graph database management systems (GDBMSs). The advantage of leveraging RDBMSs is to benefit from the maturity of their query optimization and execution. Conversely, native GDBMSs treat complex graph structures as a first-class citizens, which may make them more efficient on complex structural queries. In this work, we consider the processing of graph-relational queries, that is, queries mixing graph and relational operators, on graph data. We take a purely relational approach, reorganizing the graph connectivity information using a novel CSR Optimised Schema (COS). Based on our storage model, incoming queries are reformulated to account for the COS data organization, which can then be optimized and executed by an RDBMS. We have implemented our approach on top of PostgreSQL and we demonstrate that COS improves the performance for many graph-relational queries of the popular Social Network Benchmark 23.
Several real-time applications rely on dynamic graphs to model and store data arriving from multiple streams. In addition to the high ingestion rate, the storage and query execution challenges are amplified in contexts where consistency should be considered when storing and querying the data. This Ph.D. thesis addresses the challenges associated with multi-stream dynamic graph analytics. We propose a database design that can provide scalable storage and indexing to support consistent read-only analytical queries (present and historical), in the presence of real-time dynamic graph updates that arrive continuously from multiple streams 22.
Big data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting analytical users' performance goals and budgetary constraints. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This project addressed the resource optimization problem for a custom-built big data processing system (MaxCompute) of the Alibaba Cloud. It supports multi-objective resource optimization via fine-grained instance-level modeling and optimization. We proposed a new architecture that breaks RO into simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level RO decisions well under a second 18.
In our increasingly digital and connected society, high-volume data streams have become more and more prevalent and complex. This has made incidents more likely, diverse, and therefore harder to manually anticipate and diagnose for humans. In this project, we aim to assist such anticipation and diagnosis through automated solutions, focusing on deep, unsupervised, and explainable anomaly detection. We conducted multiple comparative studies on the recently proposed Exathlon benchmark, focusing on two main use cases. In the first use case, we aim to detect and explain anomalies in time series recordings from repeated executions of Spark Streaming applications. In this use case, we observed that reconstruction-based and forecasting-based detection methods could generalize well to different execution environments while exhibiting complementary behaviors. In the second use case, we target anomalies in financial transactions coming from the Swift messaging network. In both use cases, our current research focuses on designing new methods that improve detection accuracy and explanation consistency, with a direction being to leverage a few anomaly labels, for instance via contrastive learning approaches.
Despite their widespread adoption, anomaly detection (AD) systems thus far have mainly focused on detection power. However, emerging applications such as "Artificial Intelligence for IT operations" (AIOps) point to the need for "explainable anomaly detection" to enhance business operations with proactive, personalized, and dynamic insight and further enable corrective or preventive action to resolve IT performance issues. In our ongoing work, we address explainable AD by proposing a new explainable AD model, which achieves explainability via a set of visually-informative patterns in low-dimensional axis-aligned projections while retaining prediction accuracy. Our model, called VIPAD, builds on a classical explainable classification framework, VIPR, but addresses its fundamental limitations for anomaly detection. Our evaluation using the latest anomaly detection benchmark Exathlon for AIOps shows that VIPAD can approximate the accuracy of random forests, which is not explainable while outperforming other explainable models in both prediction accuracy and quality of explanations.
Several targeted advertising platforms offer transparency mechanisms, but researchers and civil societies repeatedly showed that those have major limitations. In this paper, we propose a collaborative ad transparency method to infer, without the cooperation of ad platforms, the targeting parameters used by advertisers to target their ads. Our idea is to ask users to donate data about their attributes and the ads they receive and to use this data to infer the targeting attributes of an ad campaign. We propose a Maximum Likelihood Estimator based on a simplified Bernoulli ad delivery model. We first test our inference method through controlled ad experiments on Facebook. Then, to further investigate the potential and limitations of collaborative ad transparency, we propose a simulation framework that allows varying key parameters. We validate that our framework gives accuracies consistent with real-world observations such that the insights from our simulations are transferable to the real world. We then perform an extensive simulation study for ad campaigns that target a combination of two attributes. Our results show that we can obtain good accuracy whenever at least ten monitored users receive an ad. This usually requires a few thousand monitored users, regardless of population size. Our simulation framework is based on a new method to generate a synthetic population with statistical properties resembling the actual population, which may be of independent interest.
Until April 2022, A. Anadiotis has been a full-time Assistant Professor at Ecole Polytechnique, where he has been in charge of two courses:
O. Balalau is a part-time (33%) assistant professor at Ecole Polytechnique, where she teaches two courses:
Yanlei Diao: University of Massachusetts Amherst, CMPSCI645, January 24 - March 11, 2022.
I. Manolescu is a part-time (50%) professor at Ecole Polytechnique, where she is in charge of the following:
I. Manolescu also teaches a course at Institut Mines Télécom:
Team members also collaborate in teaching courses at Institut Polytechnique de Paris:
PhD supervision:
Engineers supervision:
Intern supervision:
O.Balalau, Y.Diao and I. Manolescu reviewed student applications for the Data AI master track at IPP.