Magnet is a research group that aims to design new machine learning based methods geared towards mining information networks. Information networks are large collections of interconnected data and documents like citation networks and blog networks among others. Our goal is to propose new prediction methods for texts and networks of texts based on machine learning algorithms in graphs. Such algorithms include node and link classification, link prediction, clustering and probabilistic modeling of graphs. We aim to tackle real-world problems such as browsing, monitoring and recommender systems, and more broadly information extraction in information networks. Application domains cover natural language processing, social networks for cultural data and e-commerce, and biomedical informatics.

The main objective of Magnet is to develop original machine learning methods for networked data in order to build applications like browsing, monitoring and recommender systems, and more broadly information extraction in information networks. We consider information networks in which the data consist of both feature vectors and texts. We model such networks as (multiple) (hyper)graphs wherein nodes correspond to entities (documents, spans of text, users, ...) and edges correspond to relations between entities (similarity, answer, co-authoring, friendship, ...). Our main research goal is to propose new on-line and batch learning algorithms for various problems (node classification / clustering, link classification / prediction) which exploit the relationships between data entities and, overall, the graph topology. We are also interested in searching for the best hidden graph structure to be generated for solving a given learning task. Our research will be based on generative models for graphs, on machine learning for graphs and on machine learning for texts. The challenges are the dimensionality of the input space, possibly the dimensionality of the output space, the high level of dependencies between the data, the inherent ambiguity of textual data and the limited amount of human labeling. An additional challenge will be to design scalable methods for large information networks. Hence, we will explore how sampling, randomization and active learning can be leveraged to improve the scalability of the proposed algorithms.

Our research program is organized according to the following questions:

How to go beyond vectorial classification models in Natural Language Processing (NLP) tasks?

How to adaptively build graphs with respect to the given tasks? How to create networks from observations of information diffusion processes?

How to design methods able to achieve a good trade-off between predictive accuracy and computational complexity?

How to go beyond strict node homophilic/similarity assumptions in graph-based learning methods?

One of our overall research objectives is to derive graph-based machine learning algorithms for natural language and text information extraction tasks. This section discusses the motivations behind the use of graph-based ML approaches for these tasks, the main challenges associated with it, as well as some concrete projects. Some of the challenges go beyond NLP problems and will be further developed in the next sections. An interesting aspect of the project is that we anticipate some important cross-fertilizations between NLP and ML graph-based techniques, with NLP not only benefiting from but also pushing ML graph-based approaches into new directions.

Motivations for resorting to graph-based algorithms for texts are at least threefold. First, online texts are organized in networks. With the advent of the web, and the development of forums, blogs, and micro-blogging, and other forms of social media, text productions have become strongly connected. Interestingly, NLP research has been rather slow in coming to terms with this situation, and most of the literature still focus on document-based or sentence-based predictions (wherein inter-document or inter-sentence structure is not exploited). Furthermore, several multi-document tasks exist in NLP (such as multi-document summarization and cross-document coreference resolution), but most existing work typically ignore document boundaries and simply apply a document-based approach, therefore failing to take advantage of the multi-document dimension , .

A second motivation comes from the fact that most (if not all) NLP
problems can be naturally conceived as graph problems. Thus, NLP tasks
often involve discovering a relational structure over a set of text
spans (words, phrases, clauses, sentences, etc.). Furthermore, the
*input* of numerous NLP tasks is also a graph; indeed, most
end-to-end NLP systems are conceived as pipelines wherein the output
of one processor is in the input of the next. For instance, several
tasks take POS tagged sequences or dependency trees as input. But this
structured input is often converted to a vectorial form, which
inevitably involves a loss of information.

Finally, graph-based representations and learning methods appear to address some core problems faced by NLP, such as the fact that textual data are typically not independent and identically distributed, they often live on a manifold, they involve very high dimensionality, and their annotations is costly and scarce. As such, graph-based methods represent an interesting alternative to, or at least complement, structured prediction methods (such as CRFs or structured SVMs) commonly used within NLP. Graph-based methods, like label propagation, have also been shown to be very effective in semi-supervised settings, and have already given some positive results on a few NLP tasks , .

Given the above motivations, our first line of research will be to investigate how one can leverage an underlying network structure (e.g., hyperlinks, user links) between documents, or text spans in general, to enhance prediction performance for several NLP tasks. We think that a “network effect”, similar to the one that took place in Information Retrieval (with the Page Rank algorithm), could also positively impact NLP research. A few recent papers have already opened the way, for instance in attempting to exploit Twitter follower graph to improve sentiment classification .

Part of the challenge here will be to investigate how adequately and efficiently one can model these problems as instances of more general graph-based problems, such as node clustering/classification or link prediction discussed in the next sections. In a few cases, like text classification or sentiment analysis, graph modeling appears to be straightforward: nodes correspond to texts (and potentially users), and edges are given by relationships like hyperlinks, co-authorship, friendship, or thread membership. Unfortunately, modeling NLP problems as networks is not always that obvious. From the one hand, the right level of representation will probably vary depending on the task at hand: the nodes will be sentences, phrases, words, etc. From the other hand, the underlying graph will typically not be given a priori, which in turn raises the question of how we construct it. A preliminary discussion of the issue of optimal graph construction for semi-supervised learning in NLP is given in , . We identify the issue of adaptive graph construction as an important scientific challenge for machine learning on graphs in general, and we will discuss it further in Section .

As noted above, many NLP tasks have been recast as structured
prediction problems, allowing to capture (some of the) output
dependencies.
How to best combine structured output and graph-based ML
approaches is another challenge that we intend to address. We will
initially investigate this question within a semi-supervised context,
concentrating on graph regularization and graph propagation
methods. Within such approaches, labels are typically binary or in a
small finite set. Our objective is to explore how one
propagates an exponential number of *structured labels* (like a
sequence of tags or a dependency tree) through graphs. Recent attempts
at blending structured output models with graph-based models are
investigated in , . Another
related question that we will address in this context is how does one
learn with *partial labels* (like partially specified tag
sequence or tree) and use the graph structure to complete the output
structure. This last question is very relevant to NLP problems where
human annotations are costly; being able to learn from partial
annotations could therefore allow for more targeted annotations and in
turn reduced costs .

The NLP tasks we will mostly focus on are coreference resolution and entity linking, temporal structure prediction, and discourse parsing. These tasks will be envisioned in both document and cross-document settings, although we expect to exploit inter-document links either way. Choices for these particular tasks is guided by the fact that they are still open problems for the NLP community, they potentially have a high impact for industrial applications (like information retrieval, question answering, etc.), and we already have some expertise on these tasks in the team (see for instance , , ). As a midterm goal, we also plan to work on tasks more directly relating to micro-blogging, such sentiment analysis and the automatic thread structuring of technical forums; the latter task is in fact an instance of rhetorical structure prediction . We have already initiated some work on the coreference resolution with graph-based learning, by casting the problem as an instance of spectral clustering .

In most applications, edge weights are computed through a complex data modeling process and convey crucially important information for classifying nodes, making it possible to infer information related to each data sample even exploiting the graph topology solely. In fact, a widespread approach to several classification problems is to represent the data through an undirected weighted graph in which edge weights quantify the similarity between data points. This technique for coding input data has been applied to several domains, including classification of genomic data , face recognition , and text categorization .

In some cases, the full adjacency matrix is generated by employing
suitable similarity functions chosen through a deep understanding of
the problem structure. For example for the TF-IDF representation of documents,
the affinity between pairs of samples is often estimated through the
cosine measure or the

In this project we will address the problem of adaptive graph
construction towards several directions. The first one is about how to choose the best similarity measure given the objective learning
task. This question is related to the question of metric and similarity learning
( , ) which has not been considered in the
context of graph-based learning. In the context of structured
prediction, we will develop approaches where output structures are
organized in graphs whose similarity is given by top-

A different way we envision adaptive graph construction is in the context of semi-supervised learning. Partial supervision can take various forms and an interesting and original setting is governed by two currently studied applications: detection of brain anomaly from connectome data and polls recommendation in marketing. Indeed, for these two applications, a partial knowledge of the information diffusion process can be observed while the network is unknown or only partially known. An objective is to construct (or complete) the network structure from some local diffusion information. The problem can be formalized as a graph construction problem from partially observed diffusion processes. It has been studied very recently in . In our case, the originality comes either from the existence of different sources of observations or from the large impact of node contents in the network.

We will study how to combine graphs defined by networked data and graphs built from flat data to solve a given task. This is of major importance for information networks because, as said above, we will have to deal with multiple relations between entities (texts, spans of texts, ...) and also use textual data and vectorial data.

As stated in the previous sections, graphs as complex objects provide a rich representation of data. Often enough the data is only partially available and the graph representation is very helpful in predicting the unobserved elements. We are interested in problems where the complete structure of the graph needs to be recovered and only a fraction of the links is observed. The link prediction problem falls into this category. We are also interested in the recommendation and link classification problems which can be seen as graphs where the structure is complete but some labels on the links (weights or signs) are missing. Finally we are also interested in labeling the nodes of the graph, with class or cluster memberships or with a real value, provided that we have (some information about) the labels for some of the nodes.

The semi-supervised framework will be also considered. A midterm research plan is to study how graph regularization models help for structured prediction problems. This question will be studied in the context of NLP tasks, as noted in Section , but we also plan to develop original machine learning algorithms that have a more general applicability. Inputs are networks whose nodes (texts) have to be labeled by structures. We assume that structures lie in some manifold and we want to study how labels can propagate in the network. One approach is to find a smooth labeling function corresponding to an harmonic function on both manifolds in input and output.

Scalability is one of the main issues in the design of new prediction algorithms working on networked data. It has gained more and more importance in recent years, because of the growing size of the most popular networked data that are now used by millions of people. In such contexts, learning algorithms whose computational complexity scales quadratically, or slower, in the number of considered data objects (usually nodes or edges, depending on the task) should be considered impractical.

These observations lead to the idea of using graph sparsification techniques in order to work on a part of the original network for getting results that can be easily extended and used for the whole original input. A sparsified version of the original graph can often be seen as a subset of the initial input, i.e. a suitably selected input subgraph which forms the training set (or, more in general, it is included in the training set). This holds even for the active setting. A simple example could be to find a spanning tree of the input graph, possibly using randomization techniques, with properties such that we are allowed to obtain interesting results for the initial graph dataset. We have started to explore this research direction for instance in .

At the level of mathematical foundations, the key issue to be addressed in the study of (large-scale) random networks also concerns the segmentation of network data into sets of independent and identically distributed observations. If we identify the data sample with the whole network, as it has been done in previous approaches , we typically end up with a set of observations (such as nodes or edges) which are highly interdependent and hence overly violate the classic i.i.d. assumption. In this case, the data scale can be so large and the range of correlations can be so wide, that the cost of taking into account the whole data and their dependencies is typically prohibitive. On the contrary, if we focus instead on a set of subgraphs independently drawn from a (virtually infinite) target network, we come up with a set of independent and identically distributed observations—namely the subgraphs themselves, where subgraph sampling is the underlying ergodic process . Such an approach is one principled direction for giving novel statistical foundations to random network modeling. At the same time, because one shifts the focus from the whole network to a set of subgraphs, complexity issues can be restricted to the number of subgraphs and their size. The latter quantities can be controlled much more easily than the overall network size and dependence relationships, thus allowing to tackle scalability challenges through a radically redesigned approach.

Another way to tackle scalability problems is to exploit the inherent decentralized nature of very large graphs. Indeed, in many situations very large graphs are the abstract view of the digital activities of a very large set of users equipped with their own device. Nowadays, smartphones, tablets and even sensors have storage and computation power and gather a lot of data that serve to analytics, prediction, suggestion and personalized recommendation. Gathering all user data in large data centers is costly because it requires oversized infrastructures with huge energy consumption and large bandwidth networks. Even though cloud architectures can optimize such infrastructures, data concentration is also prone to security leaks, lost of privacy and data governance for end users. The alternative we have started to develop in Magnet is to devise decentralized, private and personalized machine learning algorithms so that they can be deployed in the personal devices. The key challenges are therefore to learn in a collaborative way in a network of learners and to preserve privacy and control on personal data.

In many cases, algorithms for solving node classification problems are driven by the following assumption: linked entities tend to be assigned to the same class. This assumption, in the context of social networks, is known as homophily ( , ) and involves ties of every type, including friendship, work, marriage, age, gender, and so on. In social networks, homophily naturally implies that a set of individuals can be parted into subpopulations that are more cohesive. In fact, the presence of homogeneous groups sharing common interests is a key reason for affinity among interconnected individuals, which suggests that, in spite of its simplicity, this principle turns out to be very powerful for node classification problems in general networks.

Recently, however, researchers have started to consider networked data where connections may also carry a negative meaning. For instance, disapproval or distrust in social networks, negative endorsements on the Web. Although the introduction of signs on graph edges appears like a small change from standard weighted graphs, the resulting mathematical model, called signed graphs, has an unexpectedly rich additional complexity. For example, their spectral properties, which essentially all sophisticated node classification algorithms rely on, are different and less known than those of graphs. Signed graphs naturally lead to a specific inference problem that we have discussed in previous sections: link classification. This is the problem of predicting signs of links in a given graph. In online social networks, this may be viewed as a form of sentiment analysis, since we would like to semantically categorize the relationships between individuals.

Our main targeted applications are browsing, monitoring, recommending and mining in information networks. The learning tasks considered in the project such as node clustering, node and link classification and link prediction are likely to yield important improvements in these applications. Application domains cover social networks for cultural data and e-commerce, and biomedical informatics.

We also target applications related to decentralized learning and privacy preserving systems when users or devices are interconnected in large networks. We develop solutions based on urban and mobility data where privacy is a specific requirement.

Strengthening of the privacy aware machine learning activity with a new associate team with the Alan Turing Institute and the organization of a workshop at NeurIPS (formerly NIPS).

New collaboration with Multispeech (Inria Nancy) on decentralized and private machine learning for speech processing leading to an ANR and an H2020 project.

Aurélien Bellet received a best reviewer award (top 200 out of 3000) at the conference NeurIPS 2018. Pascal Denis received a Distinguished Senior Program Committee award at IJCAI-ECAI 2018.

*Python library for noun phrase COreference Resolution in natural language TEXts*

Keyword: Natural language processing

Functional Description: CoRTex is a LGPL-licensed Python library for Noun Phrase coreference resolution in natural language texts. This library contains implementations of various state-of-the-art coreference resolution algorithms, including those developed in our research. In addition, it provides a set of APIs and utilities for text pre-processing, reading the CONLL2012 and CONLLU annotation formats, and performing evaluation, notably based on the main evaluation metrics (MUC, B-CUBED, and CEAF). As such, CoRTex provides benchmarks for researchers working on coreference resolution, but it is also of interest for developers who want to integrate a coreference resolution within a larger platform. It currently supports use of the English or French language.

Participant: Pascal Denis

Partner: Orange Labs

Contact: Pascal Denis

*MAgnet liNGuistic wOrd vEctorS*

Keywords: Word embeddings - NLP

Functional Description: Process textual data and compute vocabularies and co-occurrence matrices. Input data should be raw text or annotated text. Compute word embeddings with different state-of-the art unsupervised methods. Propose statistical and intrinsic evaluation methods, as well as some visualization tools.

Contact: Nathalie Vauquier

Keywords: Machine learning - Python - Metric learning

Functional Description: Distance metrics are widely used in the machine learning literature. Traditionally, practicioners would choose a standard distance metric (Euclidean, City-Block, Cosine, etc.) using a priori knowledge of the domain. Distance metric learning (or simply, metric learning) is the sub-field of machine learning dedicated to automatically constructing optimal distance metrics.

This package contains efficient Python implementations of several popular metric learning algorithms.

Partner: Parietal

Contact: William De Vazelhes

Keywords: Privacy - Machine learning - Statistics

Functional Description: Decentralized algorithms for machine learning and inference tasks which (1) perform as much computation as possible locally and (2) ensure privacy and security by avoiding personal data leaves devices.

Contact: Nathalie Vauquier

We consider extensions of Hoeffding's “exponential method” approach for obtaining upper estimates on the probability that a sum of independent and bounded random variables is significantly larger than its mean. We show that the exponential function in Hoeffding's approach can be replaced with any function which is non-negative, increasing and convex. As a result we generalize and improve upon Hoeffding's inequality. Our approach allows to obtain “missing factors” in Hoeffding's inequality. The later result is a rather weaker version of a theorem that is due to Michel Talagrand. Moreover, we characterize the class of functions with respect to which our method yields optimal concentration bounds. Finally, using ideas from the theory of Bernstein polynomials, we show that similar ideas apply under information on higher moments of the random variables ().

Graphlets are small network patterns that can be counted in order to characterise the structure of a network (topology). As part of a topology optimisation process, one could use graphlet counts to iteratively modify a network and keep track of the graphlet counts, in order to achieve certain topological properties. Up until now, however, graphlets were not suited as a metric for performing topology optimisation; when millions of minor changes are made to the network structure it becomes computationally intractable to recalculate all the graphlet counts for each of the edge modifications. We propose IncGraph, a method for calculating the differences in graphlet counts with respect to the network in its previous state, which is much more efficient than calculating the graphlet occurrences from scratch at every edge modification made. In comparison to static counting approaches, our findings show IncGraph reduces the execution time by several orders of magnitude. The usefulness of this approach was demonstrated by developing a graphlet-based metric to optimise gene regulatory networks. IncGraph is able to quickly quantify the topological impact of small changes to a network, which opens novel research opportunities to study changes in topologies in evolving or online networks, or develop graphlet-based criteria for topology optimisation. IncGraph is freely available as an open-source R package on CRAN (incgraph). The development version is also available on GitHub (rcannood/incgraph) ().

Counting the number of times a pattern occurs in a database is a fundamental data mining problem. It is a subroutine in a diverse set of tasks ranging from pattern mining to supervised learning and probabilistic model learning. While a pattern and a database can take many forms, this paper focuses on the case where both the pattern and the database are graphs (networks). Unfortunately, in general, the problem of counting graph occurrences is #P-complete. In contrast to earlier work, which focused on exact counting for simple (i.e., very short) patterns, we present a sampling approach for estimating the statistics of larger graph pattern occurrences. We perform an empirical evaluation on synthetic and real-world data that validates the proposed algorithm, illustrates its practical behavior and provides insight into the trade-off between its accuracy of estimation and computational efficiency ().

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-LEARNER, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: REPEATMASKER, CENSOR and LTRDIGEST. In contrast to these methods, TE-LEARNER is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-LEARNER's predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE ().

We consider the problem of learning a high-dimensional but low-rank matrix from a large-scale dataset distributed over several machines, where low-rankness is enforced by a convex trace norm constraint. We propose DFW-Trace, a distributed Frank-Wolfe algorithm which leverages the low-rank structure of its updates to achieve efficiency in time, memory and communication usage. The step at the heart of DFW-Trace is solved approximately using a distributed version of the power method. We provide a theoretical analysis of the convergence of DFW-Trace, showing that we can ensure sublinear convergence in expectation to an optimal solution with few power iterations per epoch. We implement DFW-Trace in the Apache Spark distributed programming framework and validate the usefulness of our approach on synthetic and real data, including the ImageNet dataset with high-dimensional features extracted from a deep neural network ().

The rise of connected personal devices together with privacy concerns call for machine learning algorithms capable of leveraging the data of a large number of agents to learn personalized models under strong privacy requirements. In this paper, we introduce an efficient algorithm to address the above problem in a fully decentralized (peer-to-peer) and asynchronous fashion, with provable convergence rate. We show how to make the algorithm differentially private to protect against the disclosure of information about the personal datasets, and formally analyze the trade-off between utility and privacy. Our experiments show that our approach dramatically outperforms previous work in the non-private case, and that under privacy constraints, we can significantly improve over models learned in isolation ().

The amount of personal data collected in our everyday interactions with connected devices offers great opportunities for innovative services fueled by machine learning, as well as raises serious concerns for the privacy of individuals. In this paper, we propose a massively distributed protocol for a large set of users to privately compute averages over their joint data, which can then be used to learn predictive models. Our protocol can find a solution of arbitrary accuracy, does not rely on a third party and preserves the privacy of users throughout the execution in both the honest-but-curious and malicious adversary models. Specifically, we prove that the information observed by the adversary (the set of malicious users) does not significantly reduce the uncertainty in its prediction of private values compared to its prior belief. The level of privacy protection depends on a quantity related to the Laplacian matrix of the network graph and generally improves with the size of the graph. Furthermore, we design a verification procedure which offers protection against malicious users joining the service with the goal of manipulating the outcome of the algorithm ().

Several recent studies have shown the benefits of combining language and perception to infer word embeddings. These multimodal approaches either simply combine pre-trained textual and visual representations (e.g. features extracted from convolutional neural networks), or use the latter to bias the learning of textual word embeddings. In this work, we propose a novel probabilistic model to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus. Our approach learns textual and visual representations jointly: latent visual factors couple together a skip-gram model for co-occurrence in linguistic data and a generative latent variable model for visual data. Extensive experimental studies validate the proposed model. Concretely, on the tasks of assessing pairwise word similarity and image/caption retrieval, our approach attains equally competitive or stronger results when compared to other state-of-the-art multimodal models ().

We present a simple framework for characterizing morphological complexity and how it encodes syntactic information. In particular, we propose a new measure of morpho-syntactic complexity in terms of governor-dependent preferential attachment that explains parsing performance. Through experiments on dependency parsing with data from Universal Dependencies (UD), we show that representations derived from morphological attributes deliver important parsing performance improvements over standard word form embeddings when trained on the same datasets. We also show that the new morpho-syntactic complexity measure is predictive of the gains provided by using morphological attributes over plain forms on parsing scores, making it a tool to distinguish languages using morphology as a syntactic marker from others ().

A reciprocal recommendation problem is one where the goal of learning is not just to predict a user's preference towards a passive item (e.g., a book), but to recommend the targeted user on one side another user from the other side such that a mutual interest between the two exists. The problem thus is sharply different from the more traditional items-to-users recommendation, since a good match requires meeting the preferences at both sides. We initiate a rigorous theoretical investigation of the reciprocal recommendation task in a specific framework of sequential learning. We point out general limitations, formulate reasonable assumptions enabling effective learning and, under these assumptions, we design and analyze a computationally efficient algorithm that uncovers mutual likes at a pace comparable to that achieved by a clairvoyant algorithm knowing all user preferences in advance. Finally, we validate our algorithm against synthetic and real-world datasets, showing improved empirical performance over simple baselines ().

We consider the problem of clustering a finite set of items from pairwise similarity information. Unlike what is done in the literature on this subject, we do so in a passive learning setting, and with no specific constraints on the cluster shapes other than their size. We investigate the problem in different settings: i. an online setting, where we provide a tight characterization of the prediction complexity in the mistake bound model, and ii. a standard stochastic batch setting, where we give tight upper and lower bounds on the achievable generalization error. Prediction performance is measured both in terms of the ability to recover the similarity function encoding the hidden clustering and in terms of how well we classify each item within the set. The proposed algorithms are time efficient ().

The performance of many machine learning techniques depends on the choice of an appropriate similarity or distance measure on the input space. Similarity learning (or metric learning) aims at building such a measure from training data so that observations with the same (resp. different) label are as close (resp. far) as possible. In this paper, similarity learning is investigated from the perspective of pairwise bipartite ranking, where the goal is to rank the elements of a database by decreasing order of the probability that they share the same label with some query data point, based on the similarity scores. A natural performance criterion in this setting is pointwise ROC optimization: maximize the true positive rate under a fixed false positive rate. We study this novel perspective on similarity learning through a rigorous probabilistic framework. The empirical version of the problem gives rise to a constrained optimization formulation involving U-statistics, for which we derive universal learning rates as well as faster rates under a noise assumption on the data distribution. We also address the large-scale setting by analyzing the effect of sampling-based approximations. Our theoretical results are supported by illustrative numerical experiments ().

Similarity and metric learning provides a principled approach to construct a task-specific similarity from weakly supervised data. However, these methods are subject to the curse of dimensionality: as the number of features grows large, poor generalization is to be expected and training becomes intractable due to high computational and memory costs. In this paper, we propose a similarity learning method that can efficiently deal with high-dimensional sparse data. This is achieved through a parameterization of similarity functions by convex combinations of sparse rank-one matrices, together with the use of a greedy approximate Frank-Wolfe algorithm which provides an efficient way to control the number of active features. We show that the convergence rate of the algorithm, as well as its time and memory complexity, are independent of the data dimension. We further provide a theoretical justification of our modeling choices through an analysis of the generalization error, which depends logarithmically on the sparsity of the solution rather than on the number of features. Our experiments on datasets with up to one million features demonstrate the ability of our approach to generalize well despite the high dimensionality as well as its superiority compared to several competing methods ().

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization ().

Along a collaboration with Orange, we developed a Natural Language Processing library for co-reference resolution. The library is based on a previous work (CorTeX) and was extended in several ways. It handles the French language, it includes new features based on vectorial representations of words (word embeddings) and it is more scalable. Pascal Denis is the local PI at Inria of this project.

Jan Ramon is the local PI at Inria for the ADEME-MUST project (Méthodologie d'exploitation des données d'usage des véhicules et d'identification de nouveaux services pour les usagers et les territoires). We study machine learning and data mining methods for knowledge discovery from mobility data, which are time-stamped signals collected from cars, for example, GPS locations, accelerations and fuel consumption. We aim to discover knowledge that helps us to address important questions in the transportation system such as road safety, traffic congestion, parking, ride-sharing, pollution and energy consumption. As the mobility data contains a lot of personal information, for instance, driving styles and locations of the users, we hence also study methods that allow the users to keep their personal data and only exchange part of them to collaboratively derive the knowledge.

The project has four partners, including, Xee company, CEREMA, i-Trans and Inria. The Xee company is responsible for recruiting drivers and collecting the data. CEREMA and i-Trans function as domain experts who help us to form the questions and verify the analytical results. Magnet is responsible for developing and applying data mining methods for analyzing the data. The developed methods and the discovered knowledge from the project will be transferred to Metropole Lille and ADEME.

Claim assistance is a French company that develops assistance for
conflict resolution. The main service is
RefundMyTicket

We conducted research in collaboration with J. Senechal from the department of law in Lille University. We are interested in studying the impact of technological choices regarding computation models in the perspective of the GDPR.

We strengthened our partnership with the linguistic laboratory STL in Lille university. We have welcomed Bert Cappelle for a stay (delegation) in the group. The topic of this collaboration was to study modal verbs and the translation of the notion of compositionality when applied to vectorial representation of words.

We initiated a collaboration with cognitive scientists (Angèle Brunellière and Jérémie Jozefowiez) from the psychology department, which resulted in a submission to a multidisciplinary Huma-Num project, to be funded by the Réseau National des Maisons des Sciences de l'Homme (RNMSH).

We started working with Christopher Fletcher (CNRS) from the History department.

These collaborations heavily rely on our work on distributional semantics and word embeddings to provide new insights into these different fields, hence also on the Mangoes toolkit developed in the team.

We participate to the *Data Advanced data science and
technologies* project (CPER Data). This project is organized
following three axes: internet of things, data science, high
performance computing. Magnet is involved in the data science axis
to develop machine learning algorithms for big data, structured data
and heterogeneous data. The project MyLocalInfo is an open API for
privacy-friendly collaborative computing in the internet of things.

**Participants**: Marc Tommasi [correspondent], Aurélien Bellet, Rémi Gilleron, Jan Ramon, Mahsa Asadi

The Pamela project aims at developing machine learning theories and algorithms in order to learn local and personalized models from data distributed over networked infrastructures. Our project seeks to provide first answers to modern information systems built by interconnecting many personal devices holding private user data in the search of personalized suggestions and recommendations. More precisely, we will focus on learning in a collaborative way with the help of neighbors in a network. We aim to lay the first blocks of a scientific foundation for these new types of systems, in effect moving from graphs of data to graphs of data and learned models. We argue that this shift is necessary in order to address the new constraints arising from the decentralization of information that is inherent to the emergence of big data. We will in particular focus on the question of learning under communication and privacy constraints. A significant asset of the project is the quality of its industrial partners, Snips and Mediego, who bring in their expertise in privacy protection and distributed computing as well as use cases and datasets. They will contribute to translate this fundamental research effort into concrete outcomes by developing personalized and privacy-aware assistants able to provide contextualized recommendations on small devices and smartphones.
https://

**Participants**: Pascal Denis [correspondent], Aurélien Bellet, Rémi Gilleron, Mikaela Keller, Marc Tommasi

The GRASP project aims at designing new graph-based Machine Learning algorithms that are better tailored to Natural Language Processing structured output problems. Focusing on semi-supervised learning scenarios, we will extend current graph-based learning approaches along two main directions: (i) the use of structured outputs during inference, and (ii) a graph construction mechanism that is more dependent on the task objective and more closely related to label inference. Combined, these two research strands will provide an important step towards delivering more adaptive (to new domains and languages), more accurate, and ultimately more useful language technologies. We will target semantic and pragmatic tasks such as coreference resolution, temporal chronology prediction, and discourse parsing for which proper Machine Learning solutions are still lacking.
https://

**Participants**: Marc Tommasi [correspondent], Aurélien Bellet, Pascal Denis, Jan Ramon, Brij Srivastava

DEEP-PRIVACY proposes a new paradigm based on a distributed, personalized, and privacy-preserving approach for speech processing, with a focus on machine learning algorithms for speech recognition. To this end, we propose to rely on a hybrid approach: the device of each user does not share its raw speech data and runs some private computations locally, while some cross-user computations are done by communicating through a server (or a peer-to-peer network). To satisfy privacy requirements at the acoustic level, the information communicated to the server should not expose sensitive speaker information.

**Participants**: Pascal Denis [correspondent], Bo Li

With colleagues from the linguistics departments at Lille 3 and Neuchâtel (Switzerland), Pascal Denis is a member of another ANR project (REM), funded through the bilateral ANR-NFS Scheme. This project, co-headed by I. Depreatere (Lille 3) and M. Hilpert (Neufchâtel), proposes to reconsider the analysis of English modal constructions from a multidisciplinary perspective, combining insights from theoretical, psycho-linguistic, and computational approaches.

Pascal Denis is an associate member of the Laboratoire d'Excellence *Empirical Foundations of Linguistics* (EFL), http://

Program: H2020 ICT-29-2018 (RIA)

Project acronym: COMPRISE

Project title: Cost-effective, Multilingual, Privacy-driven voice-enabled Services

Duration: Dec 2018- Nov 2021

Coordinator: Emmanuel Vincent

Other partners: Inria Multispeech, Ascora GmbH, Netfective Technology SA, Rooter Analysis SL, Tilde SIA, University of Saarland

Participants: Aurélien Bellet, Marc Tommasi, Brij Srivastava

Abstract: COMPRISE will define a fully private-by-design methodology and tools that will reduce the cost and increase the inclusiveness of voice interaction technologies.

Program: COST Action

Project acronym: TextLink

Project title: Structuring Discourse in Multilingual Europe

Duration: Apr. 2014 - Apr. 2018

Coordinator: Prof. Liesbeth Degand, Université Catholique de Louvain, Belgium. Pascal Denis is member of the Tools group.

Other partners: 26 EU countries and 3 international partner countries (Argentina, Brazil, Canada)

The Action will facilitate European multilingualism by (1) identifying and creating a portal into such resources within Europe - including annotation tools, search tools, and discourse-annotated corpora; (2) delineating the dimensions and properties of discourse annotation across corpora; (3) organizing these properties into a sharable taxonomy; (4) encouraging the use of this taxonomy in subsequent discourse annotation and in cross-lingual search and studies of devices that relate and structure discourse; and (5) promoting use of the portal, its resources and sharable taxonomy. TextLink will enhance the experience and performance of human translators, lexicographers, language technology and language learners alike.

**Inria@SiliconValley**

Associate Team involved in the International Lab:

Title: LEarning GOod representations for natural language processing

International Partner (Institution - Laboratory - Researcher):

USC (United States), Prof. Fei Sha.

Start year: 2016

See also: https://

LEGO lies in the intersection of Machine Learning and Natural Language Processing (NLP). Its goal is to address the following challenges: what are the right representations for structured data and how to learn them automatically, and how to apply such representations to complex and structured prediction tasks in NLP? In recent years, continuous vectorial embeddings learned from massive unannotated corpora have been increasingly popular, but they remain far too limited to capture the complexity of text data as they are task-agnostic and fall short of modeling complex structures in languages. LEGO strongly relies on the complementary expertise of the two partners in areas such as representation/similarity learning, structured prediction, graph-based learning, and statistical NLP to offer a novel alternative to existing techniques. Specifically, we will investigate the following three research directions: (a) optimize the embeddings based on annotations so as to minimize structured prediction errors, (b) generate embeddings from rich language contexts represented as graphs, and (c) automatically adapt the context graph to the task/dataset of interest by learning a similarity between nodes to appropriately weigh the edges of the graph. By exploring these complementary research strands, we intend to push the state-of-the-art in several core NLP problems, such as dependency parsing, coreference resolution and discourse parsing.

North-European Associate Team PAD-ML: Privacy-Aware Distributed Machine Learning.

International Partner: the PPDA team at the Alan Turing Institute.

Start year: 2018

In the context of increasing legislation on data protection (e.g., the recent GDPR), an important challenge is to develop privacy-preserving algorithms to learn from datasets distributed across multiple data owners who do not want to share their data. The goal of this joint team is to devise novel privacy-preserving, distributed machine learning algorithms and to assess their performance and guarantees in both theoretical and practical terms.

Tejas Kulkarni (University of Warwick) visited the team from May to August 2018 to work with Aurélien Bellet, Marc Tommasi and Jan Ramon on privacy-preserving computation of

Larisa Soldatova (Brunel University) visited the team in June 2018 to work with Jan Ramon on probabilistic reasoning for biomedical applications.

Raouf Kerkouche (Inria Privatics) visited the team for 2 weeks in July 2018 to work with Aurélien Bellet and Marc Tommasi on federated and decentralized learning from medical data.

Guillaume Rabusseau (Université de Montréal) visited the team for 1 week in July 2018 to work with Aurélien Bellet and Marc Tommasi on multi-task distributed spectral learning.

Daphner Ezer, Adrià Gascón, Matt Kusner, Brooks Paige (all from Alan Turing Institute) and Hamed Haddadi (Imperial College London) visited the team for 2 days in October 2018 for the kick-off of the PAD-ML associate team.

Several international researchers have also been invited to give a talk at the MAGNET seminar:

D. Hovy (Bocconi Univ.): Retrofit Everything: Injecting External Knowledge into Neural Networks to Gain Insights from Big Data.

A. Trask (OpenMined): OpenMined - Building Tools for Safe AI.

C. Biemann (Univ. Hamburg): Adaptive Interpretable Language Technology.

W. Daelemans (Univ. Antwerp): Profiling authors from social media texts.

Igor Axinti explored several ways to compare word embeddings and studied the minimal corpus size for the comparison to be meaningful. He applied some of his findings to comparing two corpus in middle french from the 15th century, one originating from London and the other from Flanders. He produced a querying interface to allow Christopher Fletcher (IRHiS), who provided the data, explore and compare the embeddings spaces.

Nicolas Crosetti (joint internship with Joachim Niehren and Florent Cappelli, Links) worked on dependency-weighted aggregation, i.e., aggregation where the elements to aggregate are weighted according to the extent where they correspond to independent observations.

Arthur d'Azemar worked on decentralized recommender systems in collaboration with the WIDE team in Inria Rennes (François Taïani). Arthur has applied metric learning techniques in order to learn a K-nn graph for personalized and adaptive user-based recommendations.

Antoine Capriski worked on the analysis of word semantic change in political texts in collaboration with Caroline Le Pennec (UC Berkeley). He used the techniques of word embeddings to analyze of corpus of political manifestos from the French general elections for the period 1958-1993.

Most of the works on machine learning and privacy make the assumption that learners are honest but curious. Alexandre Huat worked on making protocols for private machine learning more robust again malicious attacks.

Fabio Vitale is on leave at Department of Computer Science of Sapienza University (Rome, Italy) in the Algorithms Randomization Computation group with Prof. Alessandro Panconesi and Prof. Flavio Chierichetti. His current work on machine learning in graphs follows three directions:

designing new online reciprocal recommenders analyzing their performance both in theory and in practice,

clustering a finite set of items from pairwise similarity information in different learning settings,

introducing a new online learning framework encompassing several problems where the environment changes over time, and an efficient and very scalable unifying approach to solve the related general learning problem.

Current (and unfinished) ongoing research also includes the following topics: low-stretch spanning trees, active learning in correlation clustering problems, hierarchical clustering.

Aurélien Bellet visited the Alan Turing Institute (London) and Amazon Research Cambridge for 1 week in February 2018. He worked with Adrià Gascón and Borja Balle on privacy-preserving machine learning.

Aurélien Bellet was a member of the organization committee of the PPML workshop at NeurIPS'18.

Aurélien Bellet co-organized the kick-off workshop of the associated team PAD-ML with the Alan Turing Institute.

Aurélien Bellet served as PC member for AISTATS'19, ICML'18, NIPS'18, IJCAI'18 Sister Conference, PiMLAI workshop at ICML'18, and CAP'18.

Pascal Denis served as PC member for ACL'18, CONLL'18, EMNLP'18, NAACL'18, NIPS'18, IJCAI-ECAI'18 (Senior PC), CRAC Workshop at NAACL'18.

Marc Tommasi served as PC member for AAAI'18, ICML'18, CAP'18, IJCAI'18 (Senior PC chair), AISTATS'18, NIPS'18.

Jan Ramon served as PC member for AAAI'19, AISTATS'19, IEEE-BigData'18, CIKM'18, DS'18, ECML/PKDD'18, EKAW'18, IEEE-ICDM'18, ICML'18, ILP'18, LOD'18, MLG'18, NIPS'18, SDM'18, TDLSG'18.

Mikaela Keller served as PC member for ICML'18, CAP'18.

Rémi Gilleron served as PC member for NIPS'18, CAP'18, AISTATS'19 and ICLR'19.

Aurélien Bellet was reviewer for Machine Learning Journal and IEEE/ACM Transactions on Networking.

Pascal Denis was reviewer for Computational Linguistics, IJCAI-ECAI Surveys, and Language Resources and Evaluation.

Jan Ramon was member of the editorial boards of Machine Learning Journal (MLJ) and Data Mining and Knowledge Discovery (DMKD). Jan Ramon was reviewer for among others JMLR, TPAMI, JIIS.

Aurélien Bellet gave invited talks at the EPFL-Inria 2018 workshop

Aurélien Bellet was invited to talk at the seminars of Inria WIDE, Télécom ParisTech, Statistics Seminar of Paris 6/7, CMLA (ENS Paris Saclay) and Naver Labs Europe.

Pascal Denis gave an invited talk at the Séminaire Langage, SCALab, Université de Lille, 26/01/18.

Aurélien Bellet was a member of the jury for the Gilles-Kahn PhD award of the French Society of Computer
Science (SIF), sponsored by the French Academy of Sciences.

Aurélien Bellet acted as external reviewer for the French National Research Agency (ANR), track “Projets de Recherche Collaborative – International”.

Jan Ramon was an external reviewer for the Swiss National Science Foundation (SNF).

Jan Ramon was an external reviewer for the Vienna Science and Technology Fund (WWTF).

Jan Ramon acted as an expert for the H2020 CoE and IMI programs.

Mikaela Keller is member of the Conseil du laboratoire CRIStAL.

Fabien Torre is member of the bureau du Conseil National des Universités (section 27).

Pascal Denis served as a member of the CNRS Pre-GDR NLP Group.

Pascal Denis was elected to Comité National du CNRS, section 34 (Sciences du Langage).

Licence SHS: Joël Legrand, Traitement de textes et tableur, 10h, L1, Université Lille.

Licence SHS: Marc Tommasi, Langages du Web, 24h, L2, Université Lille.

Licence MIASHS: Mikaela Keller, Python 1, 40h, L1, Université Lille.

Licence MIASHS: Marc Tommasi, Codage et représentation de l'information, 48h, L1, Université Lille.

Licence MIASHS: Mikaela Keller, Codage et représentation de l'information, 42h, L1, Université Lille.

Licence SoQ (SHS): Mikaela Keller, Algorithmique de graphes, 24h, L3, Université Lille.

Licence Marc Tommasi C2i 12h, Université Lille.

Licence Marc Tommasi Humanités numériques - Découvrir et faire découvrir la programmation, 20h, Université Lille/

Master MIASHS: Mikaela Keller, Algorithmes fondamentaux de la fouille de données, 60h, M1, Université Lille.

Master MIASHS: Joël Legrand, Apprentissage et émergence de comportements, 30h, M2, Université Lille.

Master Data Analysis & Decision Making: Aurélien Bellet, Machine Learning, 12h, Ecole Centrale de Lille.

Master / Master Spécialisé Big Data: Aurélien Bellet, Advanced Machine Learning, 15h, Télécom ParisTech.

Formation continue (Certificat d’Études Spécialisées Data Scientist): Aurélien Bellet, Supervised Learning and Support Vector Machines, 17.5h, Télécom ParisTech.

Master Informatique: Pascal Denis, Fondements de l'Apprentissage Automatique, 46h, M1, Université de Lille.

Postdoc: Melissa Ailem, InriaSiliconValley postdoctoral grant, supervised by Aurélien Bellet, Marc Tommasi, Pascal Denis and Fei Sha (University of Southern California).

Postdoc: Bo Li, supervised by Pascal Denis on ANR REM, Model Sense Disambiguation, since December 2017.

PhD: Géraud Le Falher, Characterizing edges in signed and vector-valued graphs. April 16th 2018, Marc Tommasi and Fabio Vitale and Claudio Gentile.

Phd: Ashraf M. Kibriya, Mining Frequent Patterns in Large Networks, June 2018, Jan Ramon.

PhD in progress: Mathieu Dehouck, Graph-based Learning for Multi-lingual and Multi-domain Dependency Parsing, since Oct 2015, Pascal Denis and Marc Tommasi.

PhD in progress: Onkar Pandit, Graph-based Semi-supervised Linguistic Structure Prediction, since Dec. 2017, Pascal Denis, Marc Tommasi and Liva Ralaivola (University of Marseille).

PhD in progress: Mariana Vargas Vieyra, Adaptive Graph Learning with Applications to Natural Language Processing, since Jan. 2018. Pascal Denis and Aurélien Bellet and Marc Tommasi.

PhD in progress: Brij Srivastava, Representation Learning for Privacy-Preserving Speech Recognition, since Oct 2018 Aurélien Bellet and Marc Tommasi and Emmanuel Vincent.

PhD in progress: Mahsa Asadi, On Decentralized Machine Learning, since Oct 2018. Aurélien Bellet and Marc Tommasi.

PhD in progress: Nicolas Crosetti, Privacy Risks of Aggregates in Data Centric-Workflows, since Oct 2018. Florent Capelli and Sophie Tison and Joachim Niehren and Jan Ramon.

PhD in progress: Robin Vogel, Learning to rank by similarity and performance optimization in biometric identification, since 2017 (CIFRE thesis with IDEMIA and Télécom ParisTech). Aurélien Bellet, Stéphan Clémençon and Anne Sabourin.

Aurélien Bellet was member of the PhD jury of Guillaume Papa (Télécom ParisTech), Wenjie Zheng (Sorbonne Université), Michael Blot (Sorbonne Université).

Marc Tommasi was member of the Phd jury of Gaëtan Hadjeres (*Rapporteur*), Alexandre Bérard (*Head*), Olivier Ruas (*Rapporteur*), Valentina Zantedeschi.

Pascal Denis was *rapporteur* on the Phd jury of Elena Knyazeva, Université Paris-Saclay.

Mikaela Keller was member of the recruitment committee for Assistant Professors in Computer Science at Université of Lille and at Université de St-Étienne.

Mikaela Keller was member of the Phd jury of Damien Fourure (Université de St-Étienne) and of the HDR jury of Renaud Lopes (CHRU Lille).

Rémi Gilleron was head of the PhD jury of Romain Warlop (Université de Lille).

Pascal Denis was a member of hiring committee for Junior Research Scientist at Inria Lille.

Marc Tommasi was member of the recruitment committee Assistant Professors in Computer Science at Université of Lille and for professor position at INSA de Lyon.

Aurélien Bellet is the scientific mediation contact for Inria Lille center.

Pascal Denis served as committee member on the Inria Lille Commission Emploi Recherche (CER).

Pascal Denis also served as committee member on Commission de Développement Technologique (CDT).

Pascal Denis is administrator of Inria membership to Linguistic Data Consortium (LDC).

Aurélien Bellet and Marc Tommasi provided expertise for an upcoming TV program on Arte about new technologies.

National events: Jan Ramon and Marc Tommasi participate to a round-table meeting at the *Fête des libertés numériques* for the RGPD day

In educational institutions: Marc Tommasi gave a talk on privacy and machine learning in Journées polytech