Keywords
 A3. Data and knowledge
 A3.1. Data
 A3.1.1. Modeling, representation
 A3.1.4. Uncertain data
 A3.1.11. Structured data
 A3.3. Data and knowledge analysis
 A3.3.1. Online analytical processing
 A3.3.2. Data mining
 A3.3.3. Big data analysis
 A3.4. Machine learning and statistics
 A3.4.1. Supervised learning
 A3.4.2. Unsupervised learning
 A3.4.3. Reinforcement learning
 A3.4.4. Optimization and learning
 A3.4.5. Bayesian methods
 A3.4.6. Neural networks
 A3.4.8. Deep learning
 A3.5.2. Recommendation systems
 A5.1. HumanComputer Interaction
 A5.10.7. Learning
 A8.6. Information theory
 A8.11. Game Theory
 A9. Artificial intelligence
 A9.2. Machine learning
 A9.3. Signal analysis
 A9.4. Natural language processing
 A9.7. AI algorithmics
 B2. Health
 B3.1. Sustainable development
 B3.5. Agronomy
 B9.5. Sciences
 B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
 Debabrota Basu [Inria, Starting Faculty Position]
 Remy Degenne [Inria, Starting Faculty Position]
 Emilie Kaufmann [CNRS, Researcher, HDR]
 Odalric Maillard [Inria, Researcher, HDR]
 Jill Jenn Vie [Inria, Researcher, left Scool at the end of Oct. 2021]
Faculty Member
 Philippe Preux [Team leader, Université de Lille, Professor, HDR]
PostDoctoral Fellows
 Rianne De Heide [Inria, in Scool since Sep 2021]
 Riccardo Della Vecchia [Inria, in Scool since Sep 2021]
 Timothee Mathieu [Inria, in Scool since Oct 2021]
 Sein Minn [Inria]
 Mohit Mittal [Inria]
 Andrea Tirinzoni [Inria, in Scool since Apr 2021]
PhD Students
 Achraf Azize [Université de Lille, in Scool since Oct 2021]
 Dorian Baudry [CNRS]
 Omar Darwiche Domingues [Inria]
 Johan Ferret [Google]
 Yannis FletBerliac [Université de Lille, left Scool in Sep 2021]
 Nathan Grinsztajn [École polytechnique]
 Leonard HussenotDesenonges [Google]
 Marc Jourdan [Université de Lille, in Scool as an intern since Apr 2021, then PhD student since Oct 2021]
 Matheus Medeiros Centa [Université de Lille, in Scool since Apr 2021 as an intern, then as a PhD student]
 Reda Ouhamma [École polytechnique]
 Sarah Perrin [Université de Lille]
 Fabien Pesquerel [École Normale Supérieure de Paris]
 Clemence Reda [Université de Paris]
 Hassan Saber [Inria, from Sep 2021]
 Patrick Saux [Inria]
 Mathieu Seurin [Université de Lille, left Scool at the end of Aug 2021]
 Xuedong Shang [Université de Lille, left Scool at the end of June 2021]
 Jean Tarbouriech [Facebook]
Technical Staff
 Hernan David Carvajal Bastidas [Inria, Engineer, in Scool since Mar 2021]
 David Emukpere [Inria, Engineer, in Scool since Nov 2021]
 Clemence Leguillette [Inria, Engineer, left Scool at the end of Jan. 2021]
 Antoine MoulinBonno [Inria, Engineer, in Scool from Feb 2021 until Apr 2021]
 Vianney Taquet [Inria, Engineer, left Scool at the end of Jan. 2021]
 Julien Teigny [Inria, Engineer]
Interns and Apprentices
 Thomas Carta [Inria, in Scool from July to Sep.]
 Florent Dufay [École normale supérieure de Rennes, in Scool in June and Jul 2021]
 Toby Johnstone [Université de Lille, in Sccol from Apr. to Sep.]
 Marc Jourdan [Inria, from Mar 2021 until Jul 2021]
 Danilo Moreira Lagos [Inria, in Scool from Sep. to Dec.]
Administrative Assistant
 Amelie Supervielle [Inria]
Visiting Scientist
 Bishwamittra Ghosh [Université nationale de Singapour, in Scool since Nov 2021]
2 Overall objectives
Scool is a machine learning (ML) research group. Scool's research focuses on the study of the sequential decision making under uncertainty problem (SDMUP). In particular, we will consider bandit problems and the reinforcement learning (RL) problem. In a simplified way, RL considers the problem of learning an optimal policy in a Markov Decision Problem (MDP); when the set of states collapses to a single state, this is known as the bandit problem which focuses on the exploration/exploitation problem.
Bandit and RL problems are interesting to study on their own; both types of problems share a number of fundamental issues (convergence analysis, sample complexity, representation, safety, etc); both problems have real applications, different though closely related; the fact that while solving an RL problem, one faces an exploration/exploitation problem and has to solve a bandit problem in each state connects the two types of problems very intimately.
In our work, we also consider settings going beyond the Markovian assumption, in particular nonstationary settings, which represents a challenge common to bandits and RL. We also consider online learning where the goal is to learn a model from a stream of data, such as learning a compressed representation of a stream of data (each data may be a scalar, a vector, or even a more complex data structure such as a tree or a graph). A distinctive aspect of the SDMUP with regards to the rest of the field of ML is that the learning problem takes place within a closedloop interaction between a learning agent and its environment. This feedback loop makes our field of research very different from the two other subfields of ML, supervised and unsupervised learning, even when they are defined in an incremental setting. Hence, SDMUP combines ML with control: the learner is not passive: the learner acts on its environment, and learns from the consequences of these interactions; hence, the learner can act in order to obtain information from the environment.
We wish to go on, studying applied questions and developing theory to come up with sound approaches to the practical resolution of SDMUP tasks, and guide their resolution. Nonstationary environments are a particularly interesting setting; we are studying this setting and developing new tools to approach it in a sound way, in order to have algorithms to detect environment changes as fast as possible, and as reliably as possible, adapt to them, and prove their behavior, in terms of their performance, measured with the regret for instance. We mostly consider non parametric statistical models, that is models in which the number of parameters is not fixed (a parameter may be of any type: a scalar, a vector, a function, etc), so that the model can adapt along learning, and to its changing environment; this also lets the algorithm learn a representation that fits its environment.
3 Research program
Our research is mostly dealing with bandit problems, and reinforcement learning problems. We investigate each thread separately and also in combination, since the management of the exploration/exploitation tradeoff is a major issue in reinforcement learning.
On bandit problems, we focus on:
 structured bandits
 bandits for planning (in particular for Monte Carlo Tree Search (MCTS))
 non stationary bandits
Regarding reinforcement learning, we focus on:
 modeling issues, and dealing with the discrepancy between the model and the task to solve
 learning and using the structure of a Markov decision problem, and of the learned policy
 generalization in reinforcement learning
 reinforcement learning in non stationary environments
Beyond these objectives, we put a particular emphasis on the study of nonstationary environments. Another area of great concern is the combination of symbolic methods with numerical methods, be it to provide knowledge to the learning algorithm to improve its learning curve, or to better understand what the algorithm has learned and explain its behavior, or to rely on causality rather than on mere correlation.
We also put a particular emphasis on real applications and how to deal with their constraints: lack of a simulator, difficulty to have a realistic model of the problem, small amount of data, dealing with risks, availability of expert knowledge on the task.
4 Application domains
Scool has 2 main topics of application:
 health
 sustainable development
In each of these two domains, we put forward the investigation and the application of the idea of sequential decision making under uncertainty. Though supervised and non supervised learning have already been studied and applied extensively, sequential decision making remains far less studied; bandits have already been used in many applications of ecommerce (e.g. for computational advertising and recommendation systems). However, in applications where human beings may be severely impacted, bandits and reinforcement learning have not been studied much; moreover, these applications come along with a scarcity of data, and the non availability of a simulator, which prevents heavy computational simulations to come up with safe automatic decision making.
In 2021, in health, we investigate patient followup with Prof. F. Pattou's research group (CHU Lille, INSERM, Université de Lille) in project B4H. This effort comes along with investigating how we may use medical data available locally at CHU Lille, and also the national social security data. We also investigate drug repurposing with Prof. A. DelahayeDuriez (Inserm, Université de Paris) in project Repos. We also study catheter control by way of reinforcement learning with Inria Lille group Defrost, and company Robocath (Rouen).
Regarding sustainable development, we have a set of projects and collaborations regarding agriculture and gardening. With Cirad and CGIAR, we investigate how one may recommend agricultural practices to farmers in developing countries. Through an associate team with Bihar Agriculture University (India), we investigate data collection. Inria exploratory action SR4SG concerns recommender systems at the level of individual gardens.
There are two important aspects that are amply shared common by these two application fields. First, we consider that data collection is an active task: we do not passively observe and record data: we design methods and algorithms to search for useful data. This idea is exploited in most of these works oriented towards applications. Second, many of these projects include a careful management of risks for human beings. We have to take decisions taking care of their consequences on human beings, on ecosystems and life more generally.
5 Social and environmental responsibility
Sustainable development is a major field of research and application of Scool. We investigate what machine learning can bring to sustainable development, identifiying challenges ad obstacles, and studying how to overcome them.
Let us mention here:
 sustainable agriculture in developing countries,
 sustainable gardening,
More details can be found in section 4.
6 Highlights of the year
6.1 Awards
 É. Leurent was awarded the Ph.D. prize by CNRS (GdR MACS) and club EEA, that recognizes the best research work on Modelling, Analysis and Control of Dynamical Systems for his dissertation defended in October 2020. Award tied with Eva Petitdemange.
 É. Leurent was awarded the Ph.D. prize by SanefAbertis Chair, that recognizes the best research work on Road Safety for his dissertation defended in October 2020.
7 New software and platforms
7.1 New software
7.1.1 rlberry

Keywords:
Reinforcement learning, Simulation, Artificial intelligence

Functional Description:
rlberry is a reinforcement learning (RL) library in Python for research and education. The library provides implementations of several RL agents for you to use as a starting point or as baselines, provides a set of benchmark environments, very useful to debug and challenge your algorithms, handles all random seeds for you, ensuring reproducibility of your results, and is fully compatible with several commonly used RL libraries like OpenAI gym and Stable Baselines.
 URL:

Contact:
Omar Darwiche Domingues
7.1.2 justicia

Name:
Justicia: A Stochastic SAT Approach to Formally Verify Fairness

Keywords:
Fairness, Machine learning, Verification, Fairness Verification, Fair and ethical machine learning, Formal methods

Functional Description:
justicia is a fairness verifier written in Python. The library provides a stochastic SAT encoding of multiple fairness definitions and fair ML algorithms. justicia then further verifies the fairness metric achieved by the corresponding ML algorithm. It is now available as an official Python package and can be installed using pip.

News of the Year:
2020
 URL:

Contact:
Debabrota Basu

Participant:
Bishwamittra Ghosh

Partner:
National University of Singapore
7.1.3 gymDSSAT

Keywords:
Reinforcement learning, Crop management, Sequential decision making under uncertainty, Mechanistic modeling

Functional Description:
gymDSSAT let you (learn to) manage a crop parcel, from seed selection, to daily activity in the field, to harvesting.
 URL:

Contact:
Romain Gautron

Partners:
CIRAD, Cgiar
8 New results
We organize our research results in a set of categories. The main categories are: bandit problems, reinforcement learning problems, and applications.
8.1 Bandit problems
Sample complexity bounds for stochastic shortest path with a generative model, 50
We consider the objective of computing an $\u03f5$optimal policy in a stochastic shortest path (SSP) setting, provided that we can access a generative sampling oracle. We propose two algorithms for this setting and derive PAC bounds on their sample complexity: one for the case of positive costs and the other for the case of nonnegative costs under a restricted optimality criterion. While tight sample complexity bounds have been derived for the finitehorizon and discounted MDPs, the SSP problem is a strict generalization of these settings and it poses additional technical challenges due to the fact that no specific time horizon is prescribed and policies may never terminate, i.e., we are possibly facing nonproper policies. As a consequence, we can neither directly apply existing techniques minimizing sample complexity nor rely on a regrettoPAC conversion leveraging recent regret bounds for SSP. Our analysis instead combines SSPspecific tools and variance reduction techniques to obtain the first sample complexity bounds for this setting.
Routine Bandits: Minimizing Regret on Recurring Problems, 47
We study a variant of the multiarmed bandit problem in which a learner faces every day one of B many bandit instances, and call it a routine bandit. More specifically, at each period h $\in $ [1, H] , the same bandit ${b}^{h}$ is considered during T > 1 consecutive time steps, but the identity ${b}^{h}$ is unknown to the learner. We assume all rewards distribution are Gaussian standard. Such a situation typically occurs in recommender systems when a learner may repeatedly serve the same user whose identity is unknown due to privacy issues. By combining banditidentification tests with a KLUCB type strategy, we introduce the KLUCB for Routine Bandits (KLUCBRB) algorithm. While independently running KLUCB algorithm at each period leads to a cumulative expected regret of $\Omega (HlogT)$ after H many periods when $T\to \infty $, KLUCBRB benefits from previous periods by aggregating observations from similar identified bandits, which yields a nontrivial scaling of $\Omega (logT)$. This is achieved without knowing which bandit instance is being faced by KLUCBRB on this period, nor knowing a priori the number of possible bandit instances. We provide numerical illustration that confirm the benefit of KLUCBRB while using less information about the problem compared with existing strategies for similar problems.
NonAsymptotic Sequential Tests for Overlapping Hypotheses and application to near optimal arm identification in bandit models, 12
In this paper, we study sequential testing problems with overlapping hypotheses. We first focus on the simple problem of assessing if the mean µ of a Gaussian distribution is $\ge \u03f5$– or $\le \u03f5$; if $\mu \in (\u03f5,\u03f5)$, both answers are considered to be correct. Then, we consider PACbest arm identification in a bandit model: given K probability distributions on R with means ${\mu}_{1},...,{\mu}_{K}$ , we derive the asymptotic complexity of identifying, with risk at most $\delta $, an index $I\in 1,...,K$ such that ${\mu}_{I}\ge ma{x}_{i}{\mu}_{i}\u03f5$. We provide non asymptotic bounds on the error of a parallel General Likelihood Ratio Test, which can also be used for more general testing problems. We further propose lower bound on the number of observation needed to identify a correct hypothesis. Those lower bounds rely on informationtheoretic arguments, and specifically on two versions of a change of measure lemma (a highlevel form, and a lowlevel form) whose relative merits are discussed.
From Optimality to Robustness: Dirichlet Sampling Strategies in Stochastic Bandits, 19
The stochastic multiarm bandit problem has been extensively studied under standard assumptions on the arm's distribution (e.g bounded with known support, exponential family, etc). These assumptions are suitable for many realworld problems but sometimes they require knowledge (on tails for instance) that may not be precisely accessible to the practitioner, raising the question of the robustness of bandit algorithms to model misspecification. In this paper we study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with resampling of the arms' observations and a datadependent exploration bonus. We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semibounded distributions with a mild quantile condition. We also show that a simple tuning achieve robustness with respect to a large class of unbounded distributions, at the cost of slightly worse than logarithmic asymptotic regret. We finally provide numerical experiments showing the merits of DS in a decisionmaking problem on synthetic agriculture data.
Topm identification for linear bandits, 43
Motivated by an application to drug repurposing, we propose the first algorithms to tackle the identification of the m $\ge $ 1 arms with largest means in a linear bandit model, in the fixedconfidence setting. These algorithms belong to the generic family of GapIndex Focused Algorithms (GIFA) that we introduce for Topm identification in linear bandits. We propose a unified analysis of these algorithms, which shows how the use of features might decrease the sample complexity. We further validate these algorithms empirically on simulated data and on a simple drug repurposing task.
Dealing With Misspecification In FixedConfidence Linear Topm Identification, 44
We study the problem of the identification of m arms with largest means under a fixed error rate $\delta $ (fixedconfidence Topm identification), for misspecified linear bandit models. This problem is motivated by practical applications, especially in medicine and recommendation systems, where linear models are popular due to their simplicity and the existence of efficient algorithms, but in which data inevitably deviates from linearity. In this work, we first derive a tractable lower bound on the sample complexity of any $\delta $correct algorithm for the general Topm identification problem. We show that knowing the scale of the deviation from linearity is necessary to exploit the structure of the problem. We then describe the first algorithm for this setting, which is both practical and adapts to the amount of misspecification. We derive an upper bound to its sample complexity which confirms this adaptivity and that matches the lower bound when $\delta $$\to $ 0. Finally, we evaluate our algorithm on both synthetic and realworld data, showing competitive performance with respect to existing baselines.
On LimitedMemory Subsampling Strategies for Bandits, 18
There has been a recent surge of interest in nonparametric bandit algorithms based on subsampling. One drawback however of these approaches is the additional complexity required by random subsampling and the storage of the full history of rewards. Our first contribution is to show that a simple deterministic subsampling rule, proposed in the recent work of Baudry et al. (2020) under the name of ”lastblock subsampling”, is asymptotically optimal in oneparameter exponential families. In addition, we prove that these guarantees also hold when limiting the algorithm memory to a polylogarithmic function of the time horizon. These findings open up new perspectives, in particular for nonstationary scenarios in which the arm distributions evolve over time. We propose a variant of the algorithm in which only the most recent observations are used for subsampling, achieving optimal regret guarantees under the assumption of a known number of abrupt changes. Extensive numerical simulations highlight the merits of this approach, particularly when the changes are not only affecting the means of the rewards. Optimal Thompson Sampling strategies for supportaware CVaR bandits, 17
In this paper we study a multiarm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce BCVTS for continuous bounded rewards and MCVTS for multinomial distributions. On the theoretical side, we provide a nontrivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a usecase in agriculture and on various synthetic examples.
Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits, 39
In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the FrankWolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the nonadaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly outperform nonadaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem.
Stochastic bandits with groups of similar arms, 42
We consider a variant of the stochastic multiarmed bandit problem where arms are known to be organized into different groups having the same mean. The groups are unknown but a lower bound q on their size is known. This situation typically appears when each arm can be described with a list of categorical attributes, and the (unknown) mean reward function only depends on a subset of them, the others being redundant. In this case, q is linked naturally to the number of attributes considered redundant, and the number of categories of each attribute. For this structured problem of practical relevance, we first derive the asymptotic regret lower bound and corresponding constrained optimization problem. They reveal the achievable regret can be substantially reduced when compared to the unstructured setup, possibly by a factor q. However, solving exactly the exact constrained optimization problem involves a combinatorial problem. We introduce a lowerbound inspired strategy involving a computationally efficient relaxation that is based on a sorting mechanism. We further prove it achieves a lower bound close to the optimal one up to a controlled factor, and achieves an asymptotic regret q times smaller than the unstructured one. We believe this shows it is a valuable strategy for the practitioner. Last, we illustrate the performance of the considered strategy on numerical experiments involving a large number of arms.
Indexed Minimum Empirical Divergence for Unimodal Bandits, 46
We consider a multiarmed bandit problem specified by a set of onedimensional family exponential distributions endowed with a unimodal structure. We introduce IMEDUB, a algorithm that optimally exploits the unimodalstructure, by adapting to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura [2015]. Owing to our proof technique, we are able to provide a concise finitetime analysis of IMEDUB algorithm. Numerical experiments show that IMEDUB competes with the stateoftheart algorithms.
Procrastinated Tree Search: Blackbox Optimization with Delayed, Noisy, and Multifidelity Feedback, 62
In blackbox optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In reallife, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multifidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noisetolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multifidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayedUCB1 (DUCB1) and delayedUCBV (DUCBV) algorithms. Given a horizon $T$, PCTS retains the regret bound of nondelayed HOO for expected delay of $O(logT)$ and worsens by $O\left({T}^{\frac{1\alpha}{d+2}}\right)$ for expected delays of $O\left({T}^{1\alpha}\right)$ for $\alpha \in (0,1]$. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the stateoftheart blackbox optimization methods for feedbacks with different noise levels, delays, and fidelity.
8.2 Reinforcement learning
What Matters In OnPolicy Reinforcement Learning? A LargeScale Empirical Study, 16
In recent years, onpolicy reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their stateoftheart implementations take numerous low and highlevel design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [Engstrom'20]. As a step towards filling that gap, we implement >50 such “choices” in a unified onpolicy RL framework, allowing us to investigate their impact in a largescale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for onpolicy training of RL agents.
Show me the Way: Intrinsic Motivation from Demonstrations, 34
The study of exploration in the domain of decision making has a long history but remains actively debated. From the vast literature that addressed this topic for decades under various points of view (e.g., developmental psychology, experimental design, artificial intelligence), intrinsic motivation emerged as a concept that can practically be transferred to artificial agents. Especially, in the recent field of Deep Reinforcement Learning (RL), agents implement such a concept (mainly using a novelty argument) in the shape of an exploration bonus, added to the task reward, that encourages visiting the whole environment. This approach is supported by the large amount of theory on RL for which convergence to optimality assumes exhaustive exploration. Yet, Human Beings and mammals do not exhaustively explore the world and their motivation is not only based on novelty but also on various other factors (e.g., curiosity, fun, style, pleasure, safety, competition, etc.). They optimize for lifelong learning and train to learn transferable skills in playgrounds without obvious goals. They also apply innate or learned priors to save time and stay safe. For these reasons, we propose to learn an exploration bonus from demonstrations that could transfer these motivations to an artificial agent with little assumptions about their rationale. Using an inverse RL approach, we show that complex exploration behaviors, reflecting different motivations, can be learnt and efficiently used by RL agents to solve tasks for which exhaustive exploration is prohibitive.
Primal Wasserstein Imitation Learning, 22
Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent stateaction distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little finetuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.
SelfImitation Advantage Learning, 27
Selfimitation learning is a Reinforcement Learning (RL) method that encourages actions whose returns were higher than expected, which helps in hard exploration and sparse reward problems. It was shown to improve the performance of onpolicy actorcritic methods in several discrete control tasks. Nevertheless, applying selfimitation to the mostly actionvalue based offpolicy RL methods is not straightforward. We propose SAIL, a novel generalization of selfimitation learning for offpolicy RL, based on a modification of the Bellman optimality operator that we connect to Advantage Learning. Crucially, our method mitigates the problem of stale returns by choosing the most optimistic return estimate between the observed return and the current actionvalue for selfimitation. We demonstrate the empirical effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard exploration games.
Kernelbased reinforcement Learning: A finitetime analysis, 26
We consider the explorationexploitation dilemma in finitehorizon reinforcement learning problems whose stateaction space is endowed with a metric. We introduce KernelUCBVI, a modelbased optimistic algorithm that leverages the smoothness of the MDP and a nonparametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. Unlike existing approaches with regret guarantees, it does not use any kind of partitioning of the stateaction space. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $O\left({H}^{3}{K}^{\frac{(2d}{2d+1)}}\right)$, where $d$ is the covering dimension of the joint stateaction space. This is the first regret bound for kernelbased RL using smoothing kernels, which requires very weak assumptions on the MDP and has been previously applied to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.
A kernelbased approach to nonstationary reinforcement learning in metric spaces, 25
In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in nonstationary Markov Decision Processes (MDPs) whose stateaction set is endowed with a metric. Using a nonparametric model of the MDP built with timedependent kernels, we prove a regret bound that scales with the covering dimension of the stateaction space and the total variation of the MDP with time, which quantifies its level of nonstationarity. Our method generalizes previous approaches based on sliding windows and exponential discounting used to handle changing environments. We further propose a practical implementation of KeRNS, we analyze its regret and validate it experimentally.
Don't Do What Doesn't Matter: Intrinsic Motivation with Action Usefulness, 48
Sparse rewards are doubleedged training signals in reinforcement learning: easy to design but hard to optimize. Intrinsic motivation guidances have thus been developed toward alleviating the resulting exploration problem. They usually incentivize agents to look for new states through novelty signals. Yet, such methods encourage exhaustive exploration of the state space rather than focusing on the environment's salient interaction opportunities. We propose a new exploration method, called Don't Do What Doesn't Matter (DoWhaM), shifting the emphasis from state novelty to state with relevant actions. While most actions consistently change the state when used, e.g. moving the agent, some actions are only effective in specific states, e.g., opening a door, grabbing an object. DoWhaM detects and rewards actions that seldom affect the environment. We evaluate DoWhaM on the procedurallygenerated environment MiniGrid, against stateoftheart methods. Experiments consistently show that DoWhaM greatly reduces sample complexity, installing the new stateoftheart in MiniGrid.
Learning Value Functions in Deep Policy Gradients using Residual Variance, 29
Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actorcritic framework. Our work builds on recent studies indicating that traditional actorcritic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new statevalue (resp. stateactionvalue) function approximation that learns the value of the states (resp. stateaction pairs) relative to their mean value rather than the absolute value as in conventional actorcritic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.
READYS: A Reinforcement Learning Based Strategy for Heterogeneous Dynamic Scheduling, 31
In this paper, we propose READYS, a reinforcement learning algorithm for the dynamic scheduling of computations modeled as a Directed Acyclic Graph (DAGs). Our goal is to develop a scheduling algorithm in which allocation and scheduling decisions are made at runtime, based on the state of the system, as performed in runtime systems such as StarPU or ParSEC. Reinforcement Learning is a natural candidate to achieve this task, since its general principle is to build step by step a strategy that, given the state of the system (the state of the resources and a view of the ready tasks and their successors in our case), makes a decision to optimize a global criterion. Moreover, the use of Reinforcement Learning is natural in a context where the duration of tasks (and communications) is stochastic. We propose READYS that combines Graph Convolutional Networks (GCN) with an ActorCritic Algorithm (A2C): it builds an adaptive representation of the scheduling problem on the fly and learns a scheduling strategy, aiming at minimizing the makespan. A crucial point is that READYS builds a general scheduling strategy which is neither limited to only one specific application or task graph nor one particular problem size, and that can be used to schedule any DAG. We focus on different types of task graphs originating from linear algebra factorization kernels (CHOLESKY, LU, QR) and we consider heterogeneous platforms made of a few CPUs and GPUs. We first propose to analyze the performance of READYS when learning is performed on a given (platform, kernel, problem size) combination. Using simulations, we show that the scheduling agent obtains performances very similar or even superior to algorithms from the literature, and that it is especially powerful when the scheduling environment contains a lot of uncertainty. We additionally demonstrate that our agent exhibits very promising generalization capabilities. To the best of our knowledge, this is the first paper which shows that reinforcement learning can really be used for dynamic DAG scheduling on heterogeneous resources.
There Is No Turning Back: A SelfSupervised Approach for ReversibilityAware Reinforcement Learning, 32
We propose to learn to distinguish reversible from irreversible actions for better informed decisionmaking in Reinforcement Learning (RL). From theoretical considerations, we show that approximate reversibility can be learned through a simple surrogate task: ranking randomly sampled trajectory events in chronological order. Intuitively, pairs of events that are always observed in the same order are likely to be separated by an irreversible sequence of actions. Conveniently, learning the temporal order of events can be done in a fully selfsupervised way, which we use to estimate the reversibility of actions from experience, without any priors. We propose two different strategies that incorporate reversibility in RL agents, one strategy for exploration (RAE) and one strategy for control (RAC). We demonstrate the potential of reversibilityaware agents in several environments, including the challenging Sokoban game. In synthetic tasks, we show that we can learn control policies that never fail and reduce to zero the sideeffects of interactions, even without access to the reward function.
Adaptive rewardfree exploration, 35
Rewardfree exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for rewardfree exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our rewardfree UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994 [11], originally proposed for a different objective that we call bestpolicy identification. We prove that RFUCRL needs O (SAH 4 /$\u03f5$ 2) log(1/$\delta $) episodes to output, with probability 1 – $\delta $, an $\u03f5$approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model.
Fast active learning for pure exploration in reinforcement learning, 36
Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on <i>exploring efficiently</i>. The exploration remains a challenge while it has been addressed with many handtuned heuristics with different levels of generality on one side, and a few theoretically backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particular explorations bonuses. A common rule of thumb for exploration bonuses is to use $1/\sqrt{n}$ bonus that is added to the empirical estimates of the reward, where $n$ is a number of times this particular state (or a stateaction pair) was visited. We show that, surprisingly, for a pureexploration objective of rewardfree exploration, bonuses that scale with 1/n bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon $H$. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the bestpolicy identification setting, which is another pureexploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase.
UCB Momentum Qlearning: Correcting the bias without forgetting, 37
We propose UCBMQ, Upper Confidence Bound Momentum Qlearning, a new algorithm for reinforcement learning in tabular and possibly stagedependent, episodic Markov decision process. UCBMQ is based on Qlearning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the bias that Qlearning suffers while, at the same time, limiting the impact it has on the the secondorder term of the regret. For UCBMQ, we are able to guarantee a regret of at most $O\left(\sqrt{H3SAT+H4SA}\right)$ where $H$ is the length of an episode, $S$ the number of states, $A$ the number of actions, $T$ the number of episodes and ignoring terms in poly log(SAHT). Notably, UCBMQ is the first algorithm that simultaneously matches the lower bound of $\Omega \left(\sqrt{H3SAT}\right)$ for large enough T and has a secondorder term (with respect to the horizon T) that scales only linearly with the number of states S.
Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited, 24
In this paper, we propose new problemindependent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the nonstationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of $\Omega \left(\right(H3SA/\u03f52\left)log\right(1/\delta \left)\right)$ on the sample complexity of an $(\u03f5,\delta )$PAC algorithm for best policy identification in a nonstationary MDP, relying on a construction of ”hard MDPs” which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $\Omega \left(\sqrt{H3SAT}\right)$ regret bound for nonstationary MDPs. Finally, we discuss connections to PACMDP lower bounds.
Adversarially Guided ActorCritic, 28
Despite definite success in deep reinforcement learning problems, actorcritic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KLdivergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided ActorCritic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current stateoftheart methods on a set of various hardexploration and procedurallygenerated tasks.
SENTINEL: Taming Uncertainty with Ensemblebased Distributional Reinforcement Learning, 60
In this paper, we consider risksensitive sequential decisionmaking in modelbased reinforcement learning (RL). We introduce a novel quantification of risk, namely composite risk, which takes into account both aleatory and epistemic risk during the learning process.Previous works have considered aleatory or epistemic risk individually, or, an additive combination of the two. We demonstrate that the additive formulation is a particular case of the composite risk, which underestimates the actual CVaR risk even while learning a mixture of Gaussians. In contrast, the composite risk provides a more accurate estimate. We propose to use a bootstrapping method, SENTINELK, for distributional RL. SENTINELK uses an ensemble of K learners to estimate the return distribution and additionally uses follow the regularized leader (FTRL) from bandit literature for providing a better estimate of the risk on the return distribution. Finally, we experimentally verify that SENTINELK estimates the return distribution better, and while used with composite risk estimate,demonstrates better risksensitive performance than competing RL algorithms.
Demonstrating UDO: A Unified Approach for Optimizing Transaction Code, Physical Design, and System Parameters via Reinforcement Learning, 52
UDO is a versatile tool for offline tuning of database systems for specific workloads. UDO can consider a variety of tuning choices, reaching from picking transaction code variants over index selections up to database system parameter tuning. UDO uses reinforcement learning to converge to nearoptimal configurations, creating and evaluating different configurations via actual query executions (instead of relying on simplifying cost models). To cater to different parameter types, UDO distinguishes heavy parameters (which are expensive to change, e.g. physical design parameters) from light parameters. Specifically for optimizing heavy parameters, UDO uses reinforcement learning algorithms that allow delaying the point at which reward feedback becomes available. This gives us the freedom to optimize the point in time and the order in which different configurations are created and evaluated (by benchmarking a workload sample). UDO uses a costbased planner to minimize configuration switching overheads. For instance, it aims to amortize the creation of expensive data structures by consecutively evaluating configurations using them. We demonstrate UDO on Postgres as well as MySQL and on TPCH as well as TPCC, optimizing a variety of light and heavy parameters concurrently.
UDO: Universal Database Optimization using Reinforcement Learning, 53
UDO is a versatile tool for offline tuning of database systems for specific workloads. UDO can consider a variety of tuning choices, reaching from picking transaction code variants over index selections up to database system parameter tuning. UDO uses reinforcement learning to converge to nearoptimal configurations, creating and evaluating different configurations via actual query executions (instead of relying on simplifying cost models). To cater to different parameter types, UDO distinguishes heavy parameters (which are expensive to change, e.g. physical design parameters) from light parameters. Specifically for optimizing heavy parameters, UDO uses reinforcement learning algorithms that allow delaying the point at which the reward feedback becomes available. This gives us the freedom to optimize the point in time and the order in which different configurations are created and evaluated (by benchmarking a workload sample). UDO uses a costbased planner to minimize reconfiguration overheads. For instance, it aims to amortize the creation of expensive data structures by consecutively evaluating configurations using them. We evaluate UDO on Postgres as well as MySQL and on TPCH as well as TPCC, optimizing a variety of light and heavy parameters concurrently.
8.3 Applications
Evaluating DAS3H on the EdNet Dataset, 20
The EdNet dataset is a massive English language dataset that poses unique challenges for student performance prediction. In this paper, we describe and comment the results of our awardwinning model DAS3H in the context of knowledge tracing in EdNet.
Deep Learning for Deep Waters: An ExpertintheLoop Machine Learning Framework for Marine Sciences, 15
Driven by the unprecedented availability of data, machine learning has become a pervasive and transformative technology across industry and science. Its importance to marine science has been codified as one goal of the UN Ocean Decade. While increasing amounts of, for example, acoustic marine data are collected for research and monitoring purposes, and machine learning methods can achieve automatic processing and analysis of acoustic data, they require large training datasets annotated or labelled by experts. Consequently, addressing the relative scarcity of labelled data is, besides increasing data analysis and processing capacities, one of the main thrust areas. One approach to address label scarcity is the expertintheloop approach which allows analysis of limited and unbalanced data efficiently. Its advantages are demonstrated with our novel deep learningbased expertintheloop framework for automatic detection of turbulent wake signatures in echo sounder data. Using machine learning algorithms, such as the one presented in this study, greatly increases the capacity to analyse large amounts of acoustic data. It would be a first step in realising the full potential of the increasing amount of acoustic data in marine sciences.
On MultiArmed Bandit Designs for DoseFinding Trials, 11
We study the problem of finding the optimal dosage in early stage clinical trials through the multiarmed bandit lens. We advocate the use of the Thompson Sampling principle, a flexible algorithm that can accommodate different types of monotonicity assumptions on the toxicity and efficacy of the doses. For the simplest version of Thompson Sampling, based on a uniform prior distribution for each dose, we provide finitetime upper bounds on the number of suboptimal dose selections, which is unprecedented for dosefinding algorithms. Through a large simulation study, we then show that variants of Thompson Sampling based on more sophisticated prior distributions outperform stateoftheart dose identification algorithms in different types of dosefinding studies that occur in phase I or phase I/II trials.
8.4 Other
On Meritocracy in Optimal Set Selection, 58
We consider the problem of selecting a set of individuals from a candidate population in order to maximise utility. When the utility function is defined over sets, this raises the question of how to define meritocracy. We define and analyse an appropriate notion of meritocracy derived from the utility function. We introduce the notion of expected marginal contributions of individuals and analyse its links to the underlying optimisation problem, our notion of meritocracy, and other notions of fairness such as the Shapley value. We also experimentally analyse the effect of different policy structures on the utility and meritocracy in a simulated college admission setting including constraints on statistical parity.
Stochastic Online Linear Regression: the Forward Algorithm to Replace Ridge, 40
We consider the problem of online linear regression in the stochastic setting. We derive high probability regret bounds for online ridge regression and the forward algorithm. This enables us to compare online regression algorithms more accurately and eliminate assumptions of bounded observations and predictions. Our study advocates for the use of the forward algorithm in lieu of ridge due to its enhanced bounds and robustness to the regularization parameter. Moreover, we explain how to integrate it in algorithms involving linear function approximation to remove a boundedness assumption without deteriorating theoretical bounds. We showcase this modification in linear bandit settings where it yields improved regret bounds. Last, we provide numerical experiments to illustrate our results and endorse our intuitions.
Interferometric Graph Transform for Community Labeling, 61
We present a new approach for learning unsupervised node representations in community graphs. We significantly extend the Interferometric Graph Transform (IGT) to community labeling: this nonlinear operator iteratively extracts features that take advantage of the graph topology through demodulation operations. An unsupervised feature extraction step cascades modulus nonlinearity with linear operators that aim at building relevant invariants for community labeling. Via a simplified model, we show that the IGT concentrates around the EIGT: those two representations are related through some ergodicity properties. Experiments on community labeling tasks show that this unsupervised representation achieves performances at the level of the state of the art on the standard and challenging datasets Cora, Citeseer, Pubmed and WikiCS.
LowRank Projections of GCNs Laplacian, 33
In this work, we study the behavior of standard models for community detection under spectral manipulations. Through various ablation experiments, we evaluate the impact of bandpass filtering on the performance of a GCN: we empirically show that most of the necessary and used information for nodes classification is contained in the lowfrequency domain, and thus contrary to images, high frequencies are less crucial to community detection. In particular, it is sometimes possible to obtain accuracies at a stateoftheart level with simple classifiers that rely only on a few low frequencies.
Fast sampling from betaensembles, 13
We study sampling algorithms for $\beta $ensembles with time complexity less than cubic in the cardinality of the ensemble. Following Dumitriu & Edelman (2002), we see the ensemble as the eigenvalues of a random tridiagonal matrix, namely a random Jacobi matrix. First, we provide a unifying and elementary treatment of the tridiagonal models associated to the three classical Hermite, Laguerre and Jacobi ensembles. For this purpose, we use simple changes of variables between successive reparametrizations of the coefficients defining the tridiagonal matrix. Second, we derive an approximate sampler for the simulation of $\beta $ensembles, and illustrate how fast it can be for polynomial potentials. This method combines a Gibbs sampler on Jacobi matrices and the diagonalization of these matrices. In practice, even for large ensembles, only a few Gibbs passes suffice for the marginal distribution of the eigenvalues to fit the expected theoretical distribution. When the conditionals in the Gibbs sampler can be simulated exactly, the same fast empirical convergence is observed for the fluctuations of the largest eigenvalue. Our experimental results support a conjecture by Krishnapur et al. (2016), that the Gibbs chain on Jacobi matrices of size $N$ mixes in $\mathcal{O}(log(N\left)\right)$.
Justicia: A Stochastic SAT Approach to Formally Verify Fairness, 30
As a technology ML is oblivious to societal good or bad, and thus, the field of fair machine learning has stepped up to propose multiple mathematical definitions, algorithms, and systems to ensure different notions of fairness in ML applications. Given the multitude of propositions, it has become imperative to formally verify the fairness metrics satisfied by different algorithms on different datasets. In this paper, we propose a stochastic satisfiability (SSAT) framework, Justicia, that formally verifies different fairness measures of supervised learning algorithms with respect to the underlying data distribution. We instantiate Justicia on multiple classification and bias mitigation algorithms, and datasets to verify different fairness metrics, such as disparate impact, statistical parity, and equalized odds. Justicia is scalable, accurate, and operates on nonBoolean and compound sensitive attributes unlike existing distributionbased verifiers, such as FairSquare and VeriFair. Being distributionbased by design, Justicia is more robust than the verifiers, such as AIF360, that operate on specific test samples. We also theoretically bound the finitesample error of the verified fairness measure.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
Participants: Philippe Preux, Léonard Hussenot, Johan Ferret, Jean Tarbouriech.
 2 contracts with Google regarding PhDs of J. Ferret and L. Hussenot (2020–2022), managed by Ph. Preux.
 1 contract with Facebook AI Research regarding PhD of J. Tarbouriech (2020–2022), managed by Ph. Preux.
10 Partnerships and cooperations
Participants: Philippe Preux, OdalricAmbrym Maillard, Émilie Kaufmann, Debabrota Basu.
10.1 International initiatives
10.1.1 Inria associate team not involved in an IIL or an international program
DC4SCM

Title:
Data Collection for Smart Crop Management

Duration:
$2020\to 2024$

Coordinator:
Ph. Preux

Partners:
 Bihar Agricultural University, India

Inria contact:
Ph. Preux

Summary:
in the context of Scool projects related to sustainable agriculture, the goal of this partnership is to collect data to train our learning algorithms.
10.1.2 STIC/MATH/CLIMAT AmSud project
 project STIC AmSud EMISTRAL managed by Inria Chile.
10.1.3 Participation in other International Programs
 I. Trummer, Assistant Professor, Cornell University, USA.
 K. Meel, Assistant Professor, National University of Singapore, Singapore.
 A. Schilep, Associate Professor, University of Gothenburg, Sweden.
10.2 International research visitors
Bishwamittra Ghosh

Status
PhD

Institution of origin:
National University of Singapore

Country:
Singapore

Dates:
Nov 2021 Feb 2022

Context of the visit:
Bishwamittra Ghosh is a fourthyear PhD student at the Department of Computer Science in National University of Singapore. He is working at the intersection of machine learning and formal methods with Dr. Kuldeep S. Meel. His PhD research is on interpretable and fair machine learning (ML). He builds AI tools to learn interpretable ML models and to verify the fairness of ML models by relying on MaxSAT (maximum satisfiability), SSAT (stochastic satisfiability), etc.
He is visiting Debabrota Basu at Scool. They have been collaborating on verification and explanation of unfairness in ML algorithms. During his visit, he plans to design a framework that can explain the source of unfairness by an ML model. Intuitively, the framework would compute an unfairness weight to an individual feature (or a subset of features) so that feature(s) with a higher weight is considered as the main source of unfairness induced by the model. Such weights are also defined as Fairness Influence Functions (FIF) of input features. There has been partial progress in computing FIF through borrowing techniques from explainability methods of ML models, such as based on Shapely values. In this research visit, Bishwamittra plans to further improve that direction.

Mobility program/type of mobility:
Internship funded by MOBLILLEX scholarship of Université de Lille
10.2.1 Visits to international teams
Research stays abroad
Debabrota Basu

Visited institution:
University of St. Gallen

Country:
Switzerland

Dates:
Nov 48, 2021

Context of the visit:
D. Basu has visited the University of St. Gallen to work with his collaborator Prof. Christos Dimitrakakis. D. Basu has collaborating with C. Dimitrakakis and his students on multiple topics of reinforcement learning theory and fairness in sequential decision making. This visit aimed to conclude some of the ongoing projects and to begin new ones.

Mobility program/type of mobility:
Research stay
10.3 European initiatives
10.3.1 Other european programs/initiatives
10.4 National initiatives
Scool is involved in 1 ANR project:
 ANR Bold, headed by V. Perchet (ENS ParisSaclay, ENSAE), local head: É. Kaufmann, 2019–2023.
Scool is involved in some Inria projects:

Challenge HPC – Big Data, headed by B. Raffin, Datamove, Grenoble.
In this challenge, we collaborate with:
 B. Raffin, on what HPC can bring and can be used at its best for reinforcement learning.
 O. Beaumont, E. Jeannot, on what RL can bring to HPC, in particular the use of RL for task scheduling.

In this challenge, we collaborate with L. Gallaraga, CR Inria Rennes, about the combination of statistical and symbolic approaches in machine learning.
 Exploratory action “Sequential Recommendation for Sustainable Gardening (SR4SG)”, headed by OA. Maillard.
Other collaborations in France:
 R. Gautron, PhD student, Cirad, agricultural practices recommendation.
 L. Soulier, Associate Professor, Sorbonne Université, reinforcement learnng for informationretrieval.
 É. Oyallon, CR CNRS, Sorbonne Université, machine learning on graphs.
 M. Valko, researcher DeepMind.
 A. DelahayeDuriez, INSERM, Université de Paris.
 B. DeSaporta, Université de Montpellier, piecewisedeterministic Markov processes.
 A. Garivier, Professor, ENS Lyon
 V. Perchet, Professor, ENSAE & Criteo AI Lab
 P. Gaillard, CR, Inria Grenoble  RhôneAlpes
 R. Rouvoy, Professor, Université de Lille, Inria Lille Nord Europe (Équipe Spirals)
 A. Bellet, CR, Inria Lille Nord Europe (Équipe Magnet)
10.5 Regional initiatives
 O.A. Maillard and Ph. Preux cochair an AI chair, funded by MEL, and Inria.
 Ph. Preux leads a collaboration with Prof. F. Pattou's service at CHU Lille/INSERM/Université de Lille regarding postsurgery patient followup. On Inria side, this collaboration involves an engineer (J. Teigny), and a Ph.D. Student (P. Saux). This collaboration is currently funded by a set of three regional projects.
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: selection
Member of the conference program committees
 Ph. Preux is in the PC of AAAI, ECML.
 E. Kaufmann is in the PC of ALT.
 OA. Maillard is in the PC of NeurIPS (Area Chair)
 D. Basu is in the PC of AAAI, IJCAI, PoPETS.
Reviewer
 OA. Maillard is reviewer at ICML, COLT and “emergency reviewer" at ICML.
 R. Degenne is reviewer at COLT, ALT and AISTATS.
 D. Basu is reviewer in the ICML, NeurIPS, AISTATS, and ICLR.
11.1.2 Journal
Member of the editorial boards
 OA. Maillard is in the editorial board of JMLR.
Reviewer  reviewing activities
 OA. Maillard is reviewer for the journal Entropy, and for the Journal of the Royal Statistical Society.
 R. Degenne is reviewer for JMLR and the Journal on Uncertainty Quantification.
 D. Basu is reviewer for IEEE Access, IEEE Transactions on Information Forensics & Security (TIFS), IEEE Transactions on Dependable & Secure Computing (TDSC), Journal of American Statistical Association (JASA).
11.1.3 Invited talks
 Ph. Preux gives a talk entitled “Sequential decision making under uncertainty” at Digital Roads, Chile (virtual)
 E. Kaufmann gives a talk entitled “On pure exploration in (episodic) Markov Decision Processes” at the ICML workshop on Reinforcement Learning Theory (virtual).
 E. Kaufmann gives a talk entitled “NonParametric Exploration in MultiArmed Bandits” at the Mathematical Statistics and Learning Worshop in Banff International Research Station (virtual).
 OA. Maillard gives a talk entitled “Reinforcement Learning & Bandits for Agroecology” at the JFPDA (within PFIA, virtual).
 OA. Maillard gives a talk entitled “A tour of Reinforcement Learning” at the Spiral team seminar, Inria Lille.
 OA. Maillard gives a talk entitled “Some recent results in Reinforcement Learning theory” at the IMAG probability and statistics seminar, Université de Montpellier.
 D. Basu gives a talk entitled “SENTINEL: Quantifying Composite Uncertainty and Its Application in Risksensitive Reinforcement Learning" at the Learning Machines Seminars, Research Institute of Sweden (RISE) (virtual).
 D. Basu gives a series of talks entitled “Privacy Preserving Machine Learning" at the Indian Statistical Institute (ISI) Kolkata (virtual).
11.1.4 Scientific expertise
 Ph. Preux is:
 a member of the IRD CSS 5 (data science and models),
 a member of the Commission d'Évaluation (CE) of Inria,
 a member of the Inria DR 2 competition,
 a member of the CRCN competition in InriaRennes,
 a member of the airlab jury at Université de Lille.
11.1.5 Research administration
 Ph. Preux is deputy scientific delegate at Inria Lille.
11.2 Teaching  Supervision  Juries
11.2.1 Teaching
 D. Basu: “Research Reading Group", M2 Data Science, Centrale Lille in 2021.
 D. Basu: “Anonymity and Privacy", M2 Artificial Intelligence and Data Science, École Normale Superieure (ENS) PSL University in 2021.
 D. Baudry taught about 64 h. at Université de Lille in 2021, in maths (L2 MIAH).
 O. Darwiche taught reinforcement learning at École Centrale de Lille (3 practical sessions) and he is grading reinforcement learning homeworks for the MVA Master at ENS Cachan.
 E. Kaufmann: Reinforcement Learning (20h), Ecole Centrale Lille.
 OA. Maillard: Statistical Reinforcement Learning (42h), MAP/INF641, Master Artificial Intelligence and advanced Visual Computing, École Polytechnique.
 OA. Maillard: Reinforcement Learning (20h), Master 2 Artificial Intelligence, École CentraleSupélec.
 F. Pesquerel: TA for Ph. Preux “prise de décision séquentielle dans l'incertain”, M2 in Comptuter Science, Université de Lille
 F. Pesquerel: TA for OA. Maillard “Reinforcement Learning”, M2 in Computer Science, École CentraleSupélec
 F. Pesquerel: TA for OA. Maillard “statistical Reinforcement Learning”, MAP/INF641, Master Artificial Intelligence and advanced Visual Computing, École Polytechnique.
 P. Saux: TA for OA. Maillard “Reinforcement Learning”, M2 in Comptuter Science, École CentraleSupélec
 Ph. Preux: « IA et apprentissage automatique », DU IA & Santé, Université de Lille
 Ph. Preux: prise de décision séquentielle dans l'incertain, M2 in Comptuter Science, Université de Lille
 R. Degenne: “Sequential learning”, M2 MVA, ENS ParisSaclay
11.2.2 Supervision
 Ph. Preux supervised the internship of:
 Matheus Medeiros Centa, M2 Maths, Sorbonne Université, Paris,
 Danilo Moreira Lagos, L3 Computer Science, Chile,
 Toby Johnstone, MVA, ENS ParisSaclay, cosupervised with N. Grinzstajn.
 OA. Maillard supervised the internship of:
 Thomas Carta, M2 Maths, École Polytechnique, Saclay.
 R. Degenne supervised the internship of:
 Marc Jourdan, Master, ETH Zurich
 D. Basu supervised the internship of:
 Pratik Karmakar, Master, RKMVERI, India
11.2.3 Juries
 Ph. Preux was a member of the juries of:
 Ph.D. in medicine of Clémence Rozwag, Université de Lille
 Ph.D. in CS of Mathieu Seurin, Université de Lille
 Ph.D. in CS of Yannis FletBerliac, Université de Lille
 Ph.D. in CS of Jérôme Buisine, Université du Littoral Côte d'Opale
 E. Kaufmann was a member of the juries of :
 Ph.D. in CS of Louis Faury (Telecom ParisTech) (reviewer)
 Ph.D. in CS of Xuedong Shang, Université de Lille (PhD advisor)
 Ph.D. in maths of Zixin Zong, National University of Singapore (reviewer)
 Ph.D. in CS of Mathiey Chambefort, Mines ParisTech
 OA. Maillard was a member of the juries of :
 Ph.D. in CS of Réda Alami, CIFRE, Université ParisSaclay (PhD coadvisor)
 Ph.D. in CS of Thibaut Cuvelier, CIFRE, CentraleSupélec (reviewer)
 Ph.D. CST in CS of Houssam Zenati, CIFRE, Univ. Grenoble (reviewer)
 INRAE concours de chargées de recherche de classe normale (MathNum department)
11.3 Popularization
11.3.1 Interventions
12 Scientific production
12.1 Major publications
 1 inproceedingsSpectral Learning from a Single Trajectory under FiniteState Policies.International conference on Machine LearningProceedings of the International conference on Machine LearningSidney, FranceJuly 2017
 2 inproceedingsMultiPlayer Bandits Revisited.Algorithmic Learning TheoryMehryar Mohri and Karthik SridharanLanzarote, SpainApril 2018
 3 inproceedingsOnly Relevant Information Matters: Filtering Out Noisy Samples to Boost RL.IJCAI 2020  International Joint Conference on Artificial IntelligenceYokohama, JapanJuly 2020
 4 inproceedingsOptimal Best Arm Identification with Fixed Confidence.29th Annual Conference on Learning Theory (COLT)49JMLR Workshop and Conference ProceedingsNew York, United StatesJune 2016
 5 articleOperatorvalued Kernels for Learning from Functional Response Data.Journal of Machine Learning Research17202016, 154
 6 inproceedingsMonteCarlo Tree Search by Best Arm Identification.NIPS 2017  31st Annual Conference on Neural Information Processing SystemsAdvances in Neural Information Processing SystemsLong Beach, United StatesDecember 2017, 123
 7 articleBoundary Crossing Probabilities for General Exponential Families.Mathematical Methods of Statistics272018
 8 inproceedingsTightening Exploration in Upper Confidence Reinforcement Learning.International Conference on Machine LearningVienna, AustriaJuly 2020
 9 inproceedingsImproving offline evaluation of contextual bandit algorithms via bootstrapping techniques.International Conference on Machine Learning32Journal of Machine Learning Research, Workshop and Conference Proceedings; Proceedings of The 31st International Conference on Machine LearningBeijing, ChinaJune 2014
 10 inproceedingsVisual Reasoning with Multihop Feature Modulation.ECCV 2018  15th European Conference on Computer Vision1120511220Part of the Lecture Notes in Computer Science book series  LNCS11209Munich, GermanySeptember 2018, 808831
12.2 Publications of the year
International journals
 11 articleOn MultiArmed Bandit Designs for DoseFinding Trials.Journal of Machine Learning ResearchJanuary 2021
 12 articleNonAsymptotic Sequential Tests for Overlapping Hypotheses and application to near optimal arm identification in bandit models.Sequential AnalysisMarch 2021
 13 articleFast sampling from betaensembles.Statistics and Computing317January 2021
 14 articleMixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals.Journal of Machine Learning ResearchDecember 2021
 15 articleDeep Learning for Deep Waters: An ExpertintheLoop Machine Learning Framework for Marine Sciences.Journal of Marine Science and Engineering92February 2021, 169
International peerreviewed conferences
 16 inproceedingsWhat Matters In OnPolicy Reinforcement Learning? A LargeScale Empirical Study.ICLR 2021  Ninth International Conference on Learning RepresentationsVienna / Virtual, AustriaMay 2021
 17 inproceedingsOptimal Thompson Sampling strategies for supportaware CVaR bandits.38th International Conference on Machine Learningproceedings of machine learning researchVirtual, United StatesJuly 2021
 18 inproceedingsOn LimitedMemory Subsampling Strategies for Bandits.ICML 2021 International Conference on Machine LearningVienna / Virtual, AustriaJuly 2021
 19 inproceedingsFrom Optimality to Robustness: Dirichlet Sampling Strategies in Stochastic Bandits.NeurIPS 2021  35th International Conference on Neural Information Processing SystemsSydney, AustraliaDecember 2021
 20 inproceedingsEvaluating DAS3H on the EdNet Dataset.AAAI 2021  The 35th Conference on Artificial Intelligence / Imagining PostCOVID Education with AIVirtual, United StatesJanuary 2021
 21 inproceedingsReinforcement Learning in Parametric MDPs with Exponential Families.Proceedings of The 24th International Conference on Artificial Intelligence and StatisticsInternational Conference on Artificial Intelligence and Statistics130Proceedings of Machine Learning ResearchSan diego, United States2021, 18551863
 22 inproceedingsPrimal Wasserstein Imitation Learning.ICLR 2021  Ninth International Conference on Learning RepresentationsVienna / Virtual, AustriaJune 2020
 23 inproceedingsOffline Reinforcement Learning with Pseudometric Learning.ICML 2021  38th International Conference on Machine Learning139virtual, FranceJune 2021
 24 inproceedingsEpisodic reinforcement learning in finite MDPs: Minimax lower bounds revisited.Algorithmic Learning TheoryParis / Virtual, FranceMarch 2021
 25 inproceedingsA kernelbased approach to nonstationary reinforcement learning in metric spaces.International Conference on Artificial Intelligence and StatisticsSan Diego / Virtual, United StatesApril 2021
 26 inproceedingsKernelbased reinforcement Learning: A finitetime analysis.International Conference on Machine LearningVienna / Virtual, AustriaJuly 2021
 27 inproceedingsSelfImitation Advantage Learning.AAMAS 2021  20th International Conference on Autonomous Agents and Multiagent SystemsLondres / Virtual, United KingdomMay 2021
 28 inproceedingsAdversarially Guided ActorCritic.ICLR 2021  International Conference on Learning RepresentationsVienna / Virtual, AustriaMay 2021
 29 inproceedingsLearning Value Functions in Deep Policy Gradients using Residual Variance.ICLR 2021  International Conference on Learning RepresentationsVienna / Virtual, AustriaMay 2021
 30 inproceedingsJusticia: A Stochastic SAT Approach to Formally Verify Fairness.Proceedings of the AAAI Conference on Artificial IntelligenceAAAI Conference on Artificial Intelligence35Proceedings of the AAAI Conference on Artificial Intelligence9Virtual, CanadaFebruary 2021, 75547563
 31 inproceedingsREADYS: A Reinforcement Learning Based Strategy for Heterogeneous Dynamic Scheduling.IEEE Cluster 2021Portland / Virtual, United StatesSeptember 2021
 32 inproceedingsThere Is No Turning Back: A SelfSupervised Approach for ReversibilityAware Reinforcement Learning.Proc. Thirtyfifth Conference on Neural Information Processing SystemsNeural Information Processing Systems (2021)Virtual, FranceDecember 2021
 33 inproceedingsLowRank Projections of GCNs Laplacian.ICLR 2021 Workshop GTRLOnline, FranceMay 2021
 34 inproceedingsShow me the Way: Intrinsic Motivation from Demonstrations.AAMAS 2021  20th International Conference on Autonomous Agents and Multiagent SystemsVirtual, United KingdomMay 2021
 35 inproceedingsAdaptive rewardfree exploration.Algorithmic Learning TheoryParis, France2021
 36 inproceedingsFast active learning for pure exploration in reinforcement learning.International Conference on Machine LearningVienna, AustriaJuly 2021
 37 inproceedingsUCB Momentum Qlearning: Correcting the bias without forgetting.International Conference on Machine LearningVienna / Virtual, AustriaJuly 2021
 38 inproceedingsMetaREVEAL: RLbased Metalearning from Learning Curves.Workshop on Interactive Adaptive Learning colocated with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021)Bilbao/Virtual, SpainSeptember 2021
 39 inproceedingsOnline Sign Identification: Minimization of the Number of Errors in Thresholding Bandits.NeurIPS 2021  35th International Conference on Neural Information Processing SystemsNeurIPS 2021  35th International Conference on Neural Information Processing SystemsVirtual, Canada2021, 125
 40 inproceedingsStochastic Online Linear Regression: the Forward Algorithm to Replace Ridge.NeurIPS 2021  35th International Conference on Neural Information Processing SystemsNeurIPS 2021  35th International Conference on Neural Information Processing SystemsVirtual, CanadaDecember 2021
 41 inproceedingsReinforcement Learning in Linear MDPs: Constant Regret and Representation Selection.ThirtyFifth Conference on Neural Information Processing SystemsVirtual, FranceDecember 2021
 42 inproceedingsStochastic bandits with groups of similar arms.NeurIPS 2021  Thirtyfifth Conference on Neural Information Processing SystemsNeurIPS 2021  Thirtyfifth Conference on Neural Information Processing SystemsStochastic bandits with groups of similar armsSydney, AustraliaDecember 2021
 43 inproceedingsTopm identification for linear bandits.Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS)130Virtual, United States2021
 44 inproceedingsDealing With Misspecification In FixedConfidence Linear Topm Identification.35th Conference on Neural Information Processing SystemsVirtual, France2021
 45 inproceedingsOffline Reinforcement Learning as AntiExploration.AAAI 2022  36th AAAI Conference on Artificial IntelligenceVancouver, CanadaFebruary 2022
 46 inproceedingsIndexed Minimum Empirical Divergence for Unimodal Bandits.NeurIPS 2021  International Conference on Neural Information Processing SystemsNeurIPS 2021  International Conference on Neural Information Processing SystemsVirtualonly Conference, United StatesDecember 2021
 47 inproceedingsRoutine Bandits: Minimizing Regret on Recurring Problems.ECMLPKDD 2021ECMLPKDD 2021Bilbao, SpainSeptember 2021
 48 inproceedingsDon't Do What Doesn't Matter: Intrinsic Motivation with Action Usefulness.Proc. Internationnal Joint Conference on Artificial Intelligence (IJCAI)Internationnal Joint Conference on Artificial Intelligence (IJCAI)Montreal, CanadaAugust 2021, 29502956
 49 inproceedingsA Provably Efficient Sample Collection Strategy for Reinforcement Learning.Neural Information Processing Systems (NeurIPS)Virtual/Sydney, AustraliaDecember 2021
 50 inproceedingsSample complexity bounds for stochastic shortest path with a generative model.Algorithmic Learning TheoryParis, France2021
 51 inproceedingsStochastic Shortest Path: Minimax, ParameterFree and Towards HorizonFree Regret.Neural Information Processing Systems (NeurIPS)Virtual/Sydney, AustraliaDecember 2021
 52 inproceedingsDemonstrating UDO: A Unified Approach for Optimizing Transaction Code, Physical Design, and System Parameters via Reinforcement Learning.SIGMOD/PODS '21: International Conference on Management of DataProceedings of the 2021 International Conference on Management of Data (SIGMOD ’21)Virtual Event, ChinaACMJune 2021, 27942797
 53 inproceedingsUDO: Universal Database Optimization using Reinforcement Learning.Proceedings of the VLDB Endowment14Proceedings of the VLDB Endowment13Sydney, AustraliaVLDB EndowmentSeptember 2021, 34023414
Doctoral dissertations and habilitation theses
 54 thesisSampleEfficient Deep Reinforcement Learning for Control, Exploration and Safety.Université de Lille  Faculté des Sciences et TechnologiesOctober 2021
 55 thesisLearning to Interact, Interacting to Learn Actioncentric Reinforcement Learning.Université de Lille  Faculté des Sciences et TechnologiesSeptember 2021
 56 thesisAdaptive methods for optimization in stochastic environments.Université de LilleSeptember 2021
Reports & preprints
 57 reportAn Integer Linear Programming Approach for Pipelined Model Parallelism.RR9452InriaJanuary 2022
 58 miscOn Meritocracy in Optimal Set Selection.November 2021
 59 miscVISUALHINTS: A VisualLingual Environment for Multimodal Reinforcement Learning.December 2021
 60 miscSENTINEL: Taming Uncertainty with Ensemblebased Distributional Reinforcement Learning.February 2021
 61 miscInterferometric Graph Transform for Community Labeling.June 2021
 62 miscProcrastinated Tree Search: Blackbox Optimization with Delayed, Noisy, and Multifidelity Feedback.November 2021