Keywords
 A3. Data and knowledge
 A3.1. Data
 A3.1.1. Modeling, representation
 A3.1.4. Uncertain data
 A3.1.11. Structured data
 A3.3. Data and knowledge analysis
 A3.3.1. Online analytical processing
 A3.3.2. Data mining
 A3.3.3. Big data analysis
 A3.4. Machine learning and statistics
 A3.4.1. Supervised learning
 A3.4.2. Unsupervised learning
 A3.4.3. Reinforcement learning
 A3.4.4. Optimization and learning
 A3.4.5. Bayesian methods
 A3.4.6. Neural networks
 A3.4.8. Deep learning
 A3.5.2. Recommendation systems
 A5.1. HumanComputer Interaction
 A5.10.7. Learning
 A8.6. Information theory
 A8.11. Game Theory
 A9. Artificial intelligence
 A9.2. Machine learning
 A9.3. Signal analysis
 A9.4. Natural language processing
 A9.7. AI algorithmics
 B2. Health
 B3.1. Sustainable development
 B3.5. Agronomy
 B9.5. Sciences
 B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
 Riadh Akrour [inria, Researcher]
 Debabrota Basu [INRIA, Researcher]
 Remy Degenne [INRIA, Researcher]
 Emilie Kaufmann [CNRS, Researcher, HDR]
 OdalricAmbrym Maillard [INRIA, Researcher, HDR]
Faculty Member
 Philippe Preux [Team leader, UNIV LILLE, Professor, HDR]
PostDoctoral Fellows
 Tuan Dam Quang Tuan [inria, from Oct 2022]
 Riccardo Della Vecchia [INRIA]
 Timothee Mathieu [INRIA]
 Mohit Mittal [INRIA, until Sep 2022]
 Alena Shilova [INRIA]
 Eduardo Vasconcellos [FFU  NITEROI, from Jul 2022]
PhD Students
 Achraf Azize [UNIV LILLE]
 Dorian Baudry [CNRS]
 Romain Gautron [cirad et CGIAR]
 Nathan Grinsztajn [LIX]
 Marc Jourdan [UNIV LILLE]
 Hector Kohler [INRIA, from Mar 2022 until Aug 2022]
 Penanklihi Cyrille Kone [INRIA, from Apr 2022 until Aug 2022]
 Matheus Medeiros Centa [UNIV LILLE]
 Reda Ouhamma [UNIV LILLE, until Aug 2022]
 Fabien Pesquerel [ENS PARIS]
 Patrick Saux [INRIA]
 Sumit Vashishtha [UNIV LILLE, from Oct 2022]
 omar darwiche [Inria]
 johan ferret [Google]
 romain gautron [Cirad/CGIAR]
 julien tarbouriech [facebook]
Technical Staff
 Hernan David Carvajal Bastidas [INRIA, Engineer]
 Tomy Soumphonphakdy [inria, Engineer, from May 2022]
Administrative Assistants
 Aurore Dalle [INRIA]
 Lucille Leclercq [inria, until Aug 2022]
 Anne Rejl [INRIA]
 Amélie Supervielle [inria]
2 Overall objectives
Scool is a machine learning (ML) research group. Scool's research focuses on the study of the sequential decision making under uncertainty problem (SDMUP). In particular, we consider bandit problems 53 and the reinforcement learning (RL) problem 52. In a simplified way, RL considers the problem of learning an optimal policy in a Markov Decision Problem (MDP) 50; when the set of states collapses to a single state, this is known as the bandit problem which focuses on the exploration/exploitation problem.
Bandit and RL problems are interesting to study on their own; both types of problems share a number of fundamental issues (convergence analysis, sample complexity, representation, safety, etc.); both problems have real applications, different though closely related; the fact that while solving an RL problem, one faces an exploration/exploitation problem and has to solve a bandit problem in each state connects the two types of problems very intimately.
In our work, we also consider settings going beyond the Markovian assumption, in particular nonstationary settings, which represents a challenge common to bandits and RL. A distinctive aspect of the SDMUP with regards to the rest of the field of ML is that the learning problem takes place within a closedloop interaction between a learning agent and its environment. This feedback loop makes our field of research very different from the two other subfields of ML, supervised and unsupervised learning, even when they are defined in an incremental setting. Hence, SDMUP combines ML with control: the learner is not passive: the learner acts on its environment, and learns from the consequences of these interactions; hence, the learner can act in order to obtain information from the environment. Naturally, the optimal control community is getting more and more interesting by RL (see e.g. 51).
We wish to go on, studying applied questions and developing theory to come up with sound approaches to the practical resolution of SDMUP tasks, and guide their resolution. Nonstationary environments are a particularly interesting setting; we are studying this setting and developing new tools to approach it in a sound way, in order to have algorithms to detect environment changes as fast as possible, and as reliably as possible, adapt to them, and prove their behavior, in terms of their performance, measured with the regret for instance. We mostly consider non parametric statistical models, that is models in which the number of parameters is not fixed (a parameter may be of any type: a scalar, a vector, a function, etc.), so that the model can adapt along learning, and to its changing environment; this also lets the algorithm learn a representation that fits its environment.
3 Research program
Our research is mostly dealing with bandit problems, and reinforcement learning problems. We investigate each thread separately and also in combination, since the management of the exploration/exploitation tradeoff is a major issue in reinforcement learning.
On bandit problems, we focus on:
 structured bandits
 bandits for planning (in particular for Monte Carlo Tree Search (MCTS))
 non stationary bandits
Regarding reinforcement learning, we focus on:
 modeling issues, and dealing with the discrepancy between the model and the task to solve
 learning and using the structure of a Markov decision problem, and of the learned policy
 generalization in reinforcement learning
 reinforcement learning in non stationary environments
Beyond these objectives, we put a particular emphasis on the study of nonstationary environments. Another area of great concern is the combination of symbolic methods with numerical methods, be it to provide knowledge to the learning algorithm to improve its learning curve, or to better understand what the algorithm has learned and explain its behavior, or to rely on causality rather than on mere correlation.
We also put a particular emphasis on real applications and how to deal with their constraints: lack of a simulator, difficulty to have a realistic model of the problem, small amount of data, dealing with risks, availability of expert knowledge on the task.
4 Application domains
Scool has 2 main topics of application:
 health
 sustainable development
In each of these two domains, we put forward the investigation and the application of the idea of sequential decision making under uncertainty. Though supervised and non supervised learning have already been studied and applied extensively, sequential decision making remains far less studied; bandits have already been used in many applications of ecommerce (e.g. for computational advertising and recommendation systems). However, in applications where human beings may be severely impacted, bandits and reinforcement learning have not been studied much; moreover, these applications come along with a scarcity of data, and the non availability of a simulator, which prevents heavy computational simulations to come up with safe automatic decision making.
In 2022, in health, we investigate patient followup with Prof. F. Pattou's research group (CHU Lille, Inserm, Université de Lille) in project B4H. This effort comes along with investigating how we may use medical data available locally at CHU Lille, and also the national social security data. We also investigate drug repurposing with Prof. A. DelahayeDuriez (Inserm, Université de Paris) in project Repos. We also study catheter control by way of reinforcement learning with Inria Lille group Defrost, and company Robocath (Rouen).
Regarding sustainable development, we have a set of projects and collaborations regarding agriculture and gardening. With Cirad and CGIAR, we investigate how one may recommend agricultural practices to farmers in developing countries. Through an associate team with Bihar Agriculture University (India), we investigate data collection. Inria exploratory action SR4SG concerns recommender systems at the level of individual gardens.
There are two important aspects that are amply shared by these two application fields. First, we consider that data collection is an active task: we do not passively observe and record data: we design methods and algorithms to search for useful data. This idea is exploited in most of these works oriented towards applications. Second, many of these projects include a careful management of risks for human beings. We have to take decisions taking care of their consequences on human beings, on ecosystems and life more generally.
5 Social and environmental responsibility
Sustainable development is a major field of research and application of Scool. We investigate what machine learning can bring to sustainable development, identifiying challenges and obstacles, and studying how to overcome them.
Let us mention here:
 sustainable agriculture in developing countries,
 sustainable gardening.
More details can be found in section 4.
6 Highlights of the year
We submitted two ANR JCJC and one ANR PRC projects and all 3 were accepted. They begin in 2023 and will last 4 years.

BIPUP
is an ANR PRC with Inserm U. 1190, headed by Ph. Preux.

FATE
is an ANR JCJC headed by R. Degenne.

REPUBLIC
is an ANR JCJC headed by D. Basu.
6.1 Awards
 D. Basu and collaborators received the Best Paper with Student Presenter Award in ACM EAAMO 2022 for their paper “On Meritocracy in Optimal Set Selection".
 T. Mathieu received the “prix solennel de thèse” of the Chancellerie des Universités de Paris in 2022.
7 New software and platforms
7.1 New software
7.1.1 rlberry

Keywords:
Reinforcement learning, Simulation, Artificial intelligence

Functional Description:
rlberry is a reinforcement learning (RL) library in Python for research and education. The library provides implementations of several RL agents for you to use as a starting point or as baselines, provides a set of benchmark environments, very useful to debug and challenge your algorithms, handles all random seeds for you, ensuring reproducibility of your results, and is fully compatible with several commonly used RL libraries like OpenAI gym and Stable Baselines.
 URL:

Contact:
Timothee Mathieu
7.1.2 gymDSSAT

Keywords:
Reinforcement learning, Crop management, Sequential decision making under uncertainty, Mechanistic modeling

Functional Description:
gymDSSAT let you (learn to) manage a crop parcel, from seed selection, to daily activity in the field, to harvesting.
 URL:

Contact:
Romain Gautron

Partners:
CIRAD, Cgiar
7.1.3 Weight Trajectory Predictor : algorithm

Name:
Weight Trajectory Predictor : algorithm

Keywords:
Medical applications, Machine learning

Scientific Description:
We performed a retrospective study of clinical data collected prospectively on patients with up to five years postoperative followup ABOS cohort, CHU Lille) and trained a supervised model to predict the relative total weight loss (“%TWL”) of a patient 1, 3, 12, 24 and 60 months after surgery. This model consists in a decision tree, written in python, taking as input a selected subset of preoperative attributes (weight, height, type of intervention, age, presence or absence of type 2 diabetes or impaired glucose tolerance, diabetes duration, smoking habits) and returns an estimation of %TWL as well as a prediction interval based on the interquartile range of %TWL observed on similar patients. The predictions of this tool have been validated both internally and externally (on rench and Dutch cohorts).

Functional Description:
The “Weight Tracjectory Predictor” algorithm is part of a larger project, whose goal is to leverage artificial intelligence techniques to improve patient care. This code is the product of a collaboration between Inria SCOOL and the UMR 1190EGID team of the CHU Lille. It aims to predict the weight loss trajectory of a patient following bariatric surgery (treatment of severe obesity) from a set of preoperative characteristics.
We performed a retrospective study of clinical data collected prospectively on patients with up to five years postoperative followup (ABOS cohort, CHU Lille) and trained a supervised model to predict the relative total weight loss (“%TWL”) of a patient 1, 3, 12, 24 and 60 months after surgery. This model consists in a decision tree, written in python, taking as input a selected subset of preoperative attributes (weight, height, type of intervention, age, presence or absence of type 2 diabetes or impaired glucose tolerance, diabetes duration, smoking habits) and returns an estimation of %TWL as well as a prediction interval based on the interquartile range of %TWL observed on similar patients. The predictions of this tool have been validated both internally and externally (on French and Dutch cohorts).
The goal of this software is to improve patient followup after bariatric surgery:  during preoperative visits, by providing clinicians with a quantitative tool to inform the patient regarding potential weight loss outcome.  during postoperative control visits, by comparing the predicted and realized weight trajectories, which may facilitate early detection of complications.
This software component will be embedded in a web app for ease of use.

Release Contributions:
Initial version
 URL:

Contact:
Julien Teigny

Participants:
Pierre Bauvin, Francois Pattou, Philippe Preux, Violeta Raverdy, Patrick Saux, Tomy Soumphonphakdy, Julien Teigny, Hélène Verkindt

Partner:
CHU de Lille
8 New results
We organize our research results in a set of categories. The main categories are: bandit problems, reinforcement learning problems, and applications.
Participants: all Scool members.
8.1 Bandits and RL theory
Efficient Algorithms for Extreme Bandits, 17
In this paper, we contribute to the Extreme Bandit problem, a variant of MultiArmed Bandits in which the learner seeks to collect the largest possible reward. We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions. This analysis motivates the introduction of Quantile of Maxima (QoMax). The properties of QoMax are sufficient to build an ExploreThenCommit (ETC) strategy, QoMaxETC, achieving strong asymptotic guarantees despite its simplicity. We then propose and analyze a more adaptive, anytime algorithm, QoMaxSDA, which combines QoMax with a subsampling method recently introduced by Baudry et al. (2021). Both algorithms are more efficient than existing approaches in two aspects (1) they lead to better empirical performance (2) they enjoy a significant reduction of the memory and time complexities.
Optimistic PAC Reinforcement Learning: the InstanceDependent View, 37
Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instancedependent view. However, for the PAC RL problem, where the goal is to identify a nearoptimal policy with high probability, little is known about their instancedependent sample complexity. A negative result of Wagenmaker et al. (2022) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instancedependent sample complexity. On the positive side, we provide the first instancedependent bound for an optimistic algorithm for PAC RL, BPIUCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of suboptimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPIUCRL is actually nearoptimal. On the technical side, our analysis is very simple thanks to a new ”target trick” of independent interest. We complement these findings with a novel hardness result explaining why the instancedependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.
Near InstanceOptimal PAC Reinforcement Learning for Deterministic MDPs, 30
In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\u03f5$optimal policy with probability $1\delta $. While minimax optimal algorithms exist for this problem, its instancedependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of suboptimality gap for stateaction pairs that we call the deterministic return gap. While our instancedependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graphtheoretical concepts (minimum flows) and a new maximumcoverage exploration strategy.
Choosing Answers in epsilonBestAnswer Identification for Linear Bandits, 24
In pureexploration problems, information is gathered sequentially to answer a question on the stochastic environment. While bestarm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\epsilon $close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a furthest answer should be identified. Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt bestarm identification algorithms to tackle $\epsilon $bestanswer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified bestarm identification algorithms.
Top Two Algorithms Revisited, 23
Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multiarmed bandit models [38], for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a leader and a challenger. Despite their good empirical performance, theoretical guarantees for fixedconfidence best arm identification have only been obtained when the arms are Gaussian with known variances. In this paper, we provide a general analysis of Top Two methods, which identifies desirable properties of the leader, the challenger, and the (possibly nonparametric) distributions of the arms. As a result, we obtain theoretically supported Top Two algorithms for best arm identification with bounded distributions. Our proof method demonstrates in particular that the sampling step used to select the leader inherited from Thompson sampling can be replaced by other choices, like selecting the empirical best arm.
IMEDRL: Regret optimal learning of ergodic Markov decision processes, 26
We consider reinforcement learning in a discrete, undiscounted, infinitehorizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multiarmed bandits, popular bandit strategies, such as the optimistic UCB, KLUCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burnin phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problemdependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMEDRL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori hardertoexploitoptimally case when they are known. Rewards are assumed lighttailed, semibounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMEDRL in finitetime against stateoftheart algorithms. IMEDRL also benefits from a light complexity.
Riskaware linear bandits with convex loss, 36
In decisionmaking problems such as the multiarmed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as meanvariance or conditional valueatrisk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such riskaware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric leastsquare problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal riskaware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning, 35
We study the problem of episodic reinforcement learning in continuous stateaction spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEFRLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEFRLSVI that yields an upper bound of $\tilde{O}\left({\left({d}^{3}{H}^{3}K\right)}^{1/2}\right)$, where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by $\sqrt{H}$ and removes the handcrafted clipping deployed in existing RLSVItype algorithms. Our regret bound is orderoptimal with respect to H and K.
8.2 Bandits and RL face Reallife constraints
Efficient ChangePoint Detection for Tackling PiecewiseStationary Bandits, 11
We introduce GLRklUCB, a novel algorithm for the piecewise iid nonstationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, klUCB, with an efficient, parameterfree, changepoint detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous nonstationary bandit algorithms using a changepoint detector, GLRklUCB does not need to be calibrated based on prior knowledge on the arms' means. We prove that this algorithm can attain a $O\left(\sqrt{TA{\rm Y}\_Tlog\left(T\right)}\right)$ regret in $T$ rounds on some “easy” instances, where A is the number of arms and ${\rm Y}\_T$ the number of changepoints, without prior knowledge of ${\rm Y}\_T$. In contrast with recently proposed algorithms that are agnostic to ${\rm Y}\_T$, we perform a numerical study showing that GLRklUCB is also very efficient in practice, beyond easy instances.
NearOptimal Collaborative Learning in Bandits, 27
This paper introduces a general multiagent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identifyin pure explorationor playin regret minimizationits optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization [30]. In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a nearoptimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a datadependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.
Exploration in Reinforcement Learning: Beyond Finite StateSpaces, 39
Reinforcement learning (RL) is a powerful machine learning framework to design algorithms that learn to make decisions and to interact with the world. Algorithms for RL can be classified as offline or online. In the offline case, the algorithm is given a fixed dataset, based on which it needs to compute a good decisionmaking strategy. In the online case, an agent needs to efficiently collect data by itself, by interacting with the environment: that is the problem of exploration in reinforcement learning. This thesis presents theoretical and practical contributions to online RL. We investigate the worstcase performance of online RL algorithms in finite environments, that is, those that can be modeled with a finite amount of states, and where the set of actions that can be taken by an agent is also finite. Such performance degrades as the number of states increases, whereas in realworld applications the state set can be arbitrarily large or continuous. To tackle this issue, we propose kernelbased algorithms for exploration that can be implemented for general state spaces, and for which we provide theoretical results under weak assumptions on the environment. Those algorithms rely on a kernel function that measures the similarity between different states, which can be defined on arbitrary statespaces, including discrete sets and Euclidean spaces, for instance. Additionally, we show that our kernelbased algorithms are able to handle nonstationary environments by using timedependent kernel functions, and we propose and analyze approximate versions of our methods to reduce their computational complexity. Finally, we introduce a scalable approximation of our kernelbased methods, that can be implemented with deep reinforcement learning and integrate different representation learning methods to define a kernel function.
Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning, 35
We study the problem of episodic reinforcement learning in continuous stateaction spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEFRLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEFRLSVI that yields an upper bound of $\tilde{O}\left({\left({d}^{3}{H}^{3}K\right)}^{1/2}\right)$, where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by $\sqrt{H}$ and removes the handcrafted clipping deployed in existing RLSVItype algorithms. Our regret bound is orderoptimal with respect to H and K.
RiskSensitive Bayesian Games for MultiAgent Reinforcement Learning under Policy Uncertainty, 34
In stochastic games with incomplete information, the uncertainty is evoked by the lack of knowledge about a player's own and the other players' types, i.e. the utility function and the policy space, and also the inherent stochasticity of different players' interactions. In existing literature, the risk in stochastic games has been studied in terms of the inherent uncertainty evoked by the variability of transitions and actions. In this work, we instead focus on the risk associated with the uncertainty over types. We contrast this with the multiagent reinforcement learning framework where the other agents have fixed stationary policies and investigate risksensitiveness due to the uncertainty about the other agents' adaptive policies. We propose risksensitive versions of existing algorithms proposed for riskneutral stochastic games, such as Iterated Best Response (IBR), Fictitious Play (FP) and a general multiobjective gradient approach using dual ascent (DAPG). Our experimental analysis shows that risksensitive DAPG performs better than competing algorithms for both social welfare and generalsum stochastic games.
SENTINEL: Taming Uncertainty with Ensemblebased Distributional Reinforcement Learning, 20
In this paper, we consider risksensitive sequential decisionmaking in Reinforcement Learning (RL). Our contributions are twofold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINELK, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINELK estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risksensitive performance than stateoftheart risksensitive and distributional RL algorithms.
On Meritocracy in Optimal Set Selection, 19
Typically, merit is defined with respect to some intrinsic measure of worth. We instead consider a setting where an individual’s worth is relative: when a decision maker (DM) selects a set of individuals from a population to maximise expected utility, it is natural to consider the expected marginal contribution (EMC) of each person to the utility. We show that this notion satisfies an axiomatic definition of fairness for this setting. We also show that for certain policy structures, this notion of fairness is aligned with maximising expected utility, while for linear utility functions it is identical to the Shapley value. However, for certain natural policies, such as those that select individuals with a specific set of attributes (e.g. high enough test scores for college admissions), there is a tradeoff between meritocracy and utility maximisation. We analyse the effect of constraints on the policy on both utility and fairness in an extensive experiments based on college admissions and outcomes in Norwegian universities.
How Biased is Your Feature?: Computing Fairness Influence Functions with Global Sensitivity Analysis, 47
Fairness in machine learning has attained significant focus due to the widespread application of machine learning in highstake decisionmaking tasks. Unless regulated with a fairness objective, machine learning classifiers might demonstrate unfairness/bias towards certain demographic populations in the data. Thus, the quantification and mitigation of the bias induced by classifiers have become a central concern. In this paper, we aim to quantify the influence of different features on the bias of a classifier. To this end, we propose a framework of Fairness Influence Function (FIF), and compute it as a scaled difference of conditional variances in the prediction of the classifier. We also instantiate an algorithm, FairXplainer, that uses variance decomposition among the subset of features and a local regressor to compute FIFs accurately, while also capturing the intersectional effects of the features. Our experimental analysis validates that FairXplainer captures the influences of both individual features and higherorder feature interactions, estimates the bias more accurately than existing local explanation methods, and detects the increase/decrease in bias due to affirmative/punitive actions in the classifier.
Algorithmic fairness verification with graphical models, 22
In recent years, machine learning (ML) algorithms have been deployed in safetycritical and highstake decisionmaking, where the fairness of algorithms is of paramount importance. Fairness in ML centers on detecting bias towards certain demographic populations induced by an ML classifier and proposes algorithmic solutions to mitigate the bias with respect to different fairness definitions. To this end, several fairness verifiers have been proposed that compute the bias in the prediction of an ML classifieressentially beyond a finite datasetgiven the probability distribution of input features. In the context of verifying linear classifiers, existing fairness verifiers are limited by accuracy due to imprecise modeling of correlations among features and scalability due to restrictive formulations of the classifiers as SSAT/SMT formulas or by sampling. In this paper, we propose an efficient fairness verifier, called FVGM, that encodes the correlations among features as a Bayesian network. In contrast to existing verifiers, FVGM proposes a stochastic subsetsum based approach for verifying linear classifiers. Experimentally, we show that FVGM leads to an accurate and scalable assessment for more diverse families of fairnessenhancing algorithms, fairness attacks, and group/causal fairness metrics than the stateoftheart. We also demonstrate that FVGM facilitates the computation of fairness influence functions as a stepping stone to detect the source of bias induced by subsets of features.
When Privacy Meets Partial Information: A Refined Analysis of Differentially Private Bandits, 16
We study the problem of multiarmed bandits with $\u03f5$global Differential Privacy (DP). First, we prove the minimax and problemdependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with $\u03f5$global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget $\u03f5$. In the highprivacy regime (small $\u03f5$), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the lowprivacy regime (large $\u03f5$), bandits with $\u03f5$global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a nearoptimal $\u03f5$ global DP extension of an indexbased optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, armdependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate $\u03f5$global DP extensions of UCB and KLUCB algorithms, namely AdaPUCB and AdaPKLUCB. AdaPKLUCB is the first algorithm that both satisfies $\u03f5$global DP and yields a regret upper bound that matches the problemdependent lower bound up to multiplicative constants.
Procrastinated Tree Search: Blackbox Optimization with Delayed, Noisy, and Multifidelity Feedback, 32
In blackbox optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In reallife, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multifidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noisetolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multifidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayedUCB1 (DUCB1) and delayedUCBV (DUCBV) algorithms. Given a horizon T, PCTS retains the regret bound of nondelayed HOO for expected delay of $O\left(logT\right)$, and worsens by ${T}^{(1\alpha )/(d+2)}$ for expected delays of $O\left({T}^{1\alpha}\right)$ for $\alpha $$\in (0,1]$. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the stateoftheart blackbox optimization methods for feedbacks with different noise levels, delays, and fidelity.
UDO: Universal Database Optimization using Reinforcement Learning, 33
UDO is a versatile tool for offline tuning of database systems for specific workloads. UDO can consider a variety of tuning choices, reaching from picking transaction code variants over index selections up to database system parameter tuning. UDO uses reinforcement learning to converge to nearoptimal configurations, creating and evaluating different configurations via actual query executions (instead of relying on simplifying cost models). To cater to different parameter types, UDO distinguishes heavy parameters (which are expensive to change, e.g. physical design parameters) from light parameters. Specifically for optimizing heavy parameters, UDO uses reinforcement learning algorithms that allow delaying the point at which the reward feedback becomes available. This gives us the freedom to optimize the point in time and the order in which different configurations are created and evaluated (by benchmarking a workload sample). UDO uses a costbased planner to minimize reconfiguration overheads. For instance, it aims to amortize the creation of expensive data structures by consecutively evaluating configurations using them. We evaluate UDO on Postgres as well as MySQL and on TPCH as well as TPCC, optimizing a variety of light and heavy parameters concurrently.
Bandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithm, 41
In this paper, we study the stochastic bandits problem with k unknown heavytailed and corrupted reward distributions or arms with timeinvariant corruption distributions. At each iteration, the player chooses an arm. Given the arm, the environment returns an uncorrupted reward with probability $1\u03f5$ and an arbitrarily corrupted reward with probability $\u03f5$. In our setting, the uncorrupted reward might be heavytailed and the corrupted reward might be unbounded. We prove a lower bound on the regret indicating that the corrupted and heavytailed bandits are strictly harder than uncorrupted or lighttailed bandits. We observe that the environments can be categorised into hardness regimes depending on the suboptimality gap $\Delta $, variance $\sigma $, and corruption proportion $\u03f5$. Following this, we design a UCBtype algorithm, namely HuberUCB, that leverages Huber's estimator for robust mean estimation. HuberUCB leads to tight upper bounds on regret in the proposed corrupted and heavytailed setting. To derive the upper bound, we prove a novel concentration inequality for Huber's estimator, which might be of independent interest.
SAAC: Safe Reinforcement Learning as an Adversarial Game of ActorCritics, 21
Although Reinforcement Learning (RL) is effective for sequential decisionmaking problems under uncertainty, it still fails to thrive in realworld systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a nonzerosum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actorcritic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, meanvariance risk sensitivity, and CVaRlike coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than riskaverse distributional RL and riskneutral soft actorcritic algorithms.
Online Instrumental Variable Regression: Regret Analysis and Bandit Feedback, 44
The independence of noise and covariates is a standard assumption in online linear regression and linear bandit literature. This assumption and the following analysis are invalid in the case of endogeneity, i.e., when the noise and covariates are correlated. In this paper, we study the online setting of instrumental variable (IV) regression, which is widely used in economics to tackle endogeneity. Specifically, we analyse and upper bound regret of TwoStage Least Squares (2SLS) approach to IV regression in the online setting. Our analysis shows that Online 2SLS (O2SLS) achieves $O\left({d}^{2}{log}^{2}T\right)$ regret after $T$ interactions, where d is the dimension of covariates. Following that, we leverage the O2SLS as an oracle to design OFULIV, a linear bandit algorithm. OFULIV can tackle endogeneity and achieves $O(d\sqrt{T}logT)$ regret. For datasets with endogeneity, we experimentally demonstrate that O2SLS and OFULIV incur lower regrets than the stateoftheart algorithms for both the online linear regression and linear bandit settings.
IMEDRL: Regret optimal learning of ergodic Markov decision processes, 26
We consider reinforcement learning in a discrete, undiscounted, infinitehorizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multiarmed bandits, popular bandit strategies, such as the optimistic UCB, KLUCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burnin phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problemdependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMEDRL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori hardertoexploitoptimally case when they are known. Rewards are assumed lighttailed, semibounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMEDRL in finitetime against stateoftheart algorithms. IMEDRL also benefits from a light complexity.
8.3 Bandits and RL for reallife: Deep RL and Applications
Entropy Regularized Reinforcement Learning with Cascading Networks, 45
Deep Reinforcement Learning (Deep RL) has had incredible achievements on high dimensional problems, yet its learning process remains unstable even on the simplest tasks. Deep RL uses neural networks as function approximators. These neural models are largely inspired by developments in the (un)supervised machine learning community. Compared to these learning frameworks, one of the major difficulties of RL is the absence of i.i.d. data. One way to cope with this difficulty is to control the rate of change of the policy at every iteration. In this work, wechallenge the common practices of the (un)supervised learning community of using a fixed neural architecture, by having a neural model that grows in size at each policy update. This allows a closed form entropy regularized policy update, which leads to a better control of the rate of change of the policy at each iteration and help cope with the non i.i.d. nature of RL. Initial experiments on classical RL benchmarks show promising results with remarkable convergence on some RL tasks when compared to other deep RL baselines, while exhibiting limitations on others.
Automated planning for robotic guidewire navigation in the coronary arteries, 29
Soft continuum robots, and comparable instruments allow to perform some surgical procedures noninvasively. While safer, less morbid and more costeffective, these medical interventions increase the complexity for the practitioners: the manipulation of anatomical structures is indirect through telescopic and flexible devices and the visual feedback is indirect through monitors. Interventional cardiology is an example of complex procedures where catheters and guidewires are manipulated to reach and treat remote areas of the vascular network. Such interventions may be assisted with a robot that will operate the tools but the planning (choice of tools and trajectories) remains a complex task. In this paper we use a simulation framework for flexible devices inside the vasculature and we propose a method to automatically control these devices to reach specific locations. Experiments performed on 15 patient geometries exhibit good performance. Automatic manipulation reaches the goal in more than 90% of the cases.
SofaGym: An open platform for Reinforcement Learning based on Soft Robot simulations, 15
OpenAI Gym is one of the standard interfaces used to train Reinforcement Learning (RL) Algorithms. The Simulation Open Framework Architecture (SOFA) is a physics based engine that is used for soft robotics simulation and control based on realtime models of deformation. The aim of this paper is to present SofaGym, an open source software to create OpenAI Gym interfaces, called environments, out of soft robot digital twins.The link between soft robotics and RL offers new challenges for both fields: representation of the soft robot in a RL context, complex interactions with the environment, use of specific mechanical tools to control soft robots, transfer of policies learned in simulation to the real world, etc. The article presents the large possible uses of SofaGym to tackle these challenges by using RL and planning algorithms. This publication contains neither new algorithms nor new models but proposes a new platform, open to the community, that offers non existing possibilities of coupling RL to physics based simulation of soft robots. We present 11 environments, representing a wide variety of soft robots and applications, we highlight the challenges showcased by each environment. We propose methods of solving the task using traditional control, RL and planning and point out research perspectives using the platform.
Reinforcement Learning for crop management, 13
Reinforcement Learning (RL), including MultiArmed Bandits, is a branch of Machine Learning that deals with the problem of sequential decisionmaking in uncertain and unknown environments through learning by practice. While best known for being the core of the Artificial Intelligence (AI) world’s best Go game player, RL has a vast potential range of applications. RL may help to address some of the criticisms leveled against crop management Decision Support Systems (DSS): it is an interactive, geared toward action, contextual tool to evaluate series of crop operations faced with uncertainties. A review of RL use for crop management DSS reveals a limited number of contributions. We profile key prospects for a humancentered, real world, interactive RLbased system to face tomorrow’s agricultural decisions and theoretical and ongoing practical challenges that may explain its current low takeup. We argue that a joint research effort from the RL and agronomy communities is necessary to explore RL’s full potential.
gymDSSAT: a crop model turned into a Reinforcement Learning environment, 46
Addressing a real world sequential decision problem with Reinforcement Learning (RL) usually starts with the use of a simulated environment that mimics real conditions. We present a novel open source RL environment for realistic crop management tasks. gymDSSAT is agym interface to the Decision Support System for Agrotechnology Transfer (DSSAT), a high fidelity crop simulator. DSSAT has been developped over the last 30 years and is widely recognized by agronomists. gymDSSAT comes with predefined simulations based on real world maize experiments.The environment is as easy to use as any gym environment. We provide performance baselines using basic RL algorithms. We also briefly outline how the monolithic DSSAT simulator written in Fortran has been turned into a Python RL environment. Our methodology is generic and may be applied to similar simulators. We report on very preliminary experimental results which suggest that RL can help researchers to improve sustainability of fertilization and irrigation practices.
Foundations and state of the art, 38
In this chapter, we address the foundational aspects of digital technology, their use in agriculture and current research
Combination of gene regulatory networks and sequential machine learning for drug repurposing, 40
Given the ever increasing cost of designing de novo molecules to target causes of diseases, and the huge amount of currently available biological data, the development of systematic explorative pipelines for drug development has become of paramount importance. In my thesis, I focused on drug repurposing, which is a paradigm that aims at identifying new therapeutic indications for known chemical compounds. Due to the already large collection of transcriptomic data that is, related to protein production through the transcription of gene DNA sequences which is publicly available, I investigated how to process in a transparent and controllable way this information about gene activity to screen molecules. The current state of research in drug development indicates that such generic approaches might considerably fasten the discovery of promising therapies, especially for neglected or rare diseases research. First, noting that transcriptomic measurements are the product of a complex dynamical system of co and intergene activity regulations, I worked on integrating in an automated fashion diverse types of biological information in order to build a model of these regulations. That is where gene regulatory networks, and more specifically, Boolean networks, intervene. Such models are useful for both explaining observed transcription levels, and for predicting the result of gene activity perturbations through molecules. Second, these models allow online in silico drug testing. While using the predictive features of Boolean networks can be costly, the core assumption of this thesis is that, combining them with sequential learning algorithms, such as multiarmed bandits, might mitigate that effect, and help control the error rate in recommended therapeutic candidates. This is the drug testing procedure suggested throughout my PhD. The question of the proper integration of known side information about the chemical compounds into multiarmed bandits is crucial, and has also been investigated further. Finally, I applied part of my work to ranking different treatment protocols for neurorepair in the case of encephalopathy in premature infants. On the theoretical side, I also contributed to the design of an algorithm which is able to extend the drug testing procedure in a distributed way, for instance across several tested populations, disease models, or research teams.
An Integer Linear Programming Approach for Pipelined Model Parallelism, 42
The training phase in Deep Neural Networks has become an important source of computing resource usage and because of the resulting volume of computation, it is crucial to perform it efficiently on parallel architectures. Even today, data parallelism is the most widely used method, but the associated requirement to replicate all the weights on the totality of computation resources poses problems of memory at the level of each node and of collective communications at the level of the platform. In this context, the model parallelism, which consists in distributing the different layers of the network over the computing nodes, is an attractive alternative. Indeed, it is expected to better distribute weights (to cope with memory problems) and it does not imply large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined/streaming approach, which leads in turn to new memory costs. The goal of this paper is to model these memory costs in detail and to show that it is possible to formalize this optimization problem as an Integer Linear Program (ILP).
MadPipe: Memory Aware Dynamic Programming Algorithm for Pipelined Model Parallelism, 18
The training phase in Deep Neural Networks (DNNs) is very computationally intensive and is nowadays often performed on parallel computing platforms, ranging from a few GPUs to several thousand GPUs. The strategy of choice for the parallelization of training is the socalled data parallel approach, based of the parallel training of the different inputs (typically images) and a the aggregation of network weights with collective communications (AllReduce). The scalability of this approach is limited both by the memory available on each node and the networking capacities for collective operations. Recently, a parallel model approach, in which the network weights are distributed and images are trained in a pipeline/stream manner over the computational nodes has been proposed (Pipedream, Gpipe). In this paper, we formalize in detail the optimization problem associated with the placement of DNN layers onto computation resources when using pipelined model parallelism, and we derive a dynamic programming based heuristic, MadPipe, that allows to significantly improve the performance of the parallel model approach compared to the literature.
Weight Offloading Strategies for Training Large DNN Models, 43
The limited memory of GPUs induces serious problems in the training phase of deep neural networks (DNNs). Indeed, with the recent tremendous increase in the size of DNN models, which can now routinely include hundreds of billions or even trillions of parameters, it is impossible to store these models in the memory of a GPU and several strategies have been devised to solve this problem. In this paper, we analyze in detail the strategy that consists in offloading the weights of some model layers from the GPU to the CPU when they are not used. Since the PCI bus bandwidth between the GPU and the CPU is limited, it is crucial to know which layers should be transferred (offloaded and prefetched) and when. We prove that this problem is in general NPComplete in the strong sense and we propose a lower bound formulation in the form of an Integer Linear Program (ILP). We propose heuristics to select the layers to offload and to build the schedule of data transfers. We show that this approach allows to build nearoptimal weight offloading strategies on realistic size DNNs and architectures.
8.4 Other
Topics in robust statistical learning, 12
Some recent contributions to robust inference are presented. Firstly, the classical problem of robust Mestimation of a location parameter is revisited using an optimal transport approachwith specifically designed Wassersteintype distancesthat reduces robustness to a continuity property. Secondly, a procedure of estimation of the distance function to a compact set is described, using union of balls. This methodology originates in the field of topological inference and offers as a byproduct a robust clustering method. Thirdly, a robust Lloydtype algorithm for clustering is constructed, using a bootstrap variant of the medianofmeans strategy. This algorithm comes with a robust initialization.
Concentration study of Mestimators using the influence function, 14
We present a new finitesample analysis of Mestimators of locations in a Hilbert space using the tool of the influence function. In particular, we show that the deviations of an Mestimator can be controlled thanks to its influence function (or its score function) and then, we use concentration inequality on Mestimators to investigate the robust estimation of the mean in high dimension in a corrupted setting (adversarial corruption setting) for bounded and unbounded score functions. For a sample of size n and covariance matrix $\Sigma $, we attain the minimax speed $Tr\left(\Sigma \right)/n+\Sigma oplog(1/\delta )/n$ with probability larger than 1 – $\delta $ in a heavytailed setting. One of the major advantages of our approach compared to others recently proposed is that our estimator is tractable and fast to compute even in very high dimension with a complexity of $O\left(ndlog\right(Tr\left(\Sigma \right)\left)\right)$ where n is the sample size and $\Sigma $ is the covariance matrix of the inliers and in the code that we make available for this article is tested to be very fast.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
Participants: Philippe Preux, Léonard Hussenot, Johan Ferret, Jean Tarbouriech.
 2 contracts with Google regarding PhDs of J. Ferret and L. Hussenot (2020–2022), managed by Ph. Preux.
 1 contract with Facebook AI Research regarding PhD of J. Tarbouriech (2020–2022), managed by Ph. Preux.
10 Partnerships and cooperations
Participants: all Scool permanent members.
10.1 International initiatives
10.1.1 Inria associate team not involved in an IIL or an international program
DC4SCM

Title:
Data Collection for Smart Crop Management

Duration:
2020 $\to $ 2024

Coordinator:
Philippe Preux

Partners:
 Bihar Agriculture University, India,
 Inria FUN, Lille.

Inria contact:
Philippe Preux

Summary:
as part of our research activities related to the application of reinforcement learning and bandits to agriculture, this associate teams aim at providing us with infield data, and also the ability to perform infield experiments. This sort of experiments is extremely useful to train our algorithms which have to explore, that is test new actions in the field and observe their outcome. This approach is complementary to the one we investigate with the use of the DSSAT simulator.
RELIANT

Title:
Reallife bandits

Duration:
2022 $\to $ 2024

Coordinator:
Junya Honda (honda@i.kyotou.ac.jp)

Partners:
 Kyoto University Kyoto (Japon)

Inria contact:
OdalricAmbrym Maillard

Summary:
The RELIANT project is about studying applicability to the realworld of sequential decision making from a reinforcement learning (RL) and multiarmed bandit (MAB) theory standpoint. Building on over a decade of leading expertise in advancing the field of MAB and RL theory, our two teams have also developed interactions with practitioners (e.g. in healthcare, personalized medicine or agriculture) in recent projects, in the quest to bring modern bandit theory to societal applications, for real. This quest for realworld reinforcement learning, rather than working in simulated and toyish environments is actually today’s main grandchallenge of the field that hinders applications to the society and industry. While MABs are acknowledged to be the most applicable building block of RL, as experts interacting with practitioners from different fields we have identify a number of key bottlenecks on which joining our efforts is expected to significantly impact the applicability of MAB to the realworld. Those as related to the typically small samples size that arise in medical applications, the complicated type of rewards distributions that arise, e.g. in agriculture, the numerous constraints (such as fairness) that should be taken into account to speed up learning and make ethical decisions, and the possible nonstationary aspects of the tasks. We suggest to connect on the mathematical level our complementary expertise on multiarmed bandit (MAB), sequential hypothesis testing (SHT) and Markov decision processes (MDP) to address these challenges and significantly advance the design of the next generation of sequential decision making algorithms for reallife applications.
10.1.2 STIC AmSud projects

Title:
emistral

Duration:
2021 $\to $ 2022

Coordinator:
Luis Marti (Inria CHile)

Partners:
 Inria Chile
 UFF, Niteroi, Brazil
 Universidad de la República, Montevideo, Uruguay
 Inria Scool, Inria AIO

Inria contact:
Philippe Preux

Summary:
The current climate crisis calls for the use of all available technology to try to understand, model, predict and hopefully work towards its mitigation. Oceans play a key role in grasping the complex and intertwined processes that govern these phenomena. Oceans and rivers play a key role in regulating the planet’s climate, weather and ecology. Recent advances in computer sciences and applied mathematics, such as machine learning, artificial intelligence, scientific computation, among others, have produced a revolution in our capacity for understanding the emergence of patterns and dynamics in complex systems while at the same time the complexity of these problems pose significant challenges to computer science itself. The key factor deciding about the success of failure of the application of these methods is having sufficient and adequate data. Oceanographic vessels have been extensively used to gather this data. However, they have been shown to be insufficient because their high operation cost, the risks involved and their limited availability. Autonomous sailboats present themselves as a viable alternative. In principle, by relying on wind energy they could operate for indefinite periods being only limited by the effects of fouling and the wear and tear of materials. Recent results in the area of machine learning are especially suited to fill this gap. In particular, reinforcement learning (RL), transfer learning (TL) and autonomous learning (AL). The combination of those methods could overcome the need of programming particular controller for every boat as it would be capable of replicating at some degree, the learning process of human skippers and sailors.
10.2 International research visitors
10.2.1 Visits of international scientists
Other international visits to the team
 Agrawal Shubhada, postdoc at GeorgiaTech, AprJune 2022.
10.3 European initiatives
10.3.1 Other european programs/initiatives

Title:
CausalXRL

Duration:
2021 $\to $ 2024

Coordinator:
Aditya Gilra, U. Amsterdam

Partners:
 U. Amsterdam
 U. Sheffield
 U. Vienna
 Inria Scool

Inria contact:
Philippe Preux

Summary:
Deep reinforcement learning systems are approaching or surpassing humanlevel performance in specific domains, from games to decision support to continuous control, albeit in noncritical environments. Most of these systems require random exploration and stateactionvaluebased exploitation of the environment. However, in important reallife domains, like medical decision support or patient rehabilitation, every decision or action must be fully justified and certainly not random. We propose to develop neural networks that learn causal models of the environment relating action to effect, initially using offline data. The models will then be interfaced with reinforcement learning and decision support networks, so that every action taken online can be explained or justified based on its expected effect. The causal model can then be refined iteratively, enabling to better predict future cascading effects of any action chain. The system, subsequently termed CausalXRL, will only propose actions that can be justified on the basis of beneficial effects. When the immediate benefit is uncertain, the system will propose explorative actions that generate mostprobable future benefit. CausalXRL thus supports the user in choosing actions based on specific expected outcomes, rather than as prescribed by a black box.
10.4 National initiatives
Scool is involved in 1 ANR project:
 ANR Bold, headed by V. Perchet (ENS ParisSaclay, ENSAE), local head: É. Kaufmann, 2019–2023.
Scool is involved in some Inria projects:

Challenge HPC – Big Data, headed by B. Raffin, Datamove, Grenoble.
In this challenge, we collaborate with:
 B. Raffin, on what HPC can bring and can be used at its best for reinforcement learning.
 O. Beaumont, E. Jeannot, on what RL can bring to HPC, in particular the use of RL for task scheduling.

In this challenge, we collaborate with L. Gallaraga, CR Inria Rennes, about the combination of statistical and symbolic approaches in machine learning.
 Exploratory action “Sequential Recommendation for Sustainable Gardening (SR4SG)”, headed by OA. Maillard.
Other collaborations in France:
 R. Gautron, PhD student, Cirad, agricultural practices recommendation.
 L. Soulier, Associate Professor, Sorbonne Université, reinforcement learning for information retrieval.
 M. Valko, researcher DeepMind.
 A. DelahayeDuriez, INSERM, Université de Paris.
 B. DeSaporta, Université de Montpellier, piecewisedeterministic Markov processes.
 A. Garivier, Professor, ENS Lyon
 V. Perchet, Professor, ENSAE & Criteo AI Lab
 P. Gaillard, CR, Inria Grenoble  RhôneAlpes
 A. Bellet, CR, Inria LilleNord Europe (Équipe Magnet)
10.5 Regional initiatives

OA. Maillard and Ph. Preux are supported by an AI chair. 3/5 of this chair is funded by the Metropole Européenne de Lille, the other 2/5 by the Université de Lille and Inria, through the AI Ph.D. ANR program. 2020–2023.
This chair is dedicated to the advancement of research on reinforcement learning.
11 Dissemination
Participants: many Scool members.
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
General chair, scientific chair
 R. Degenne coorganized the Complex Feedback in Online Learning workshop at ICML 2022.
11.1.2 Scientific events: selection
Member of the conference program committees
 E. Kaufmann is a PC member for ALT and EWRL.
 OA. Maillard is a PC member for ICML, AISTATS.
 Ph. Preux is an SPC for AAAI, and PC for IJCAI and ECML.
Reviewer
 R. Degenne: ICML and AISTATS
11.1.3 Journal
Member of the editorial boards
 OA. Maillard is in the editorial board of JMLR.
Reviewer  reviewing activities
 R. Akrour: JMLR
 R. Degenne: JMLR
 E. Kaufmann: IEEE IT, IEEE Transactions on Games, Statistica Sinica
 OA. Maillard: Journal of Machine Learning Research (JMLR), Autonomous Agents and MultiAgent Systems (AGNT), Machine Learning (MACH), The Annals of Statistics (AoS)
11.1.4 Invited talks
 R. Degenne: invited talk at the Probability & Statistics Group HeidelbergMannheim
 E. Kaufmann: plenary talk at the Journées de Statistiques, invited speaker at SNB (Statistics and Biopharmacy) 2022, invited talks at the Harvard Statistics COlloquium (virtual)
11.1.5 Scientific expertise
 Ph. Preux is:
 a member of the IRD CSS 5 (data science and models),
 a member of the Commission d'Évaluation (CE) of Inria,
 a member of the scientific committee on ethics of the Health Data warehouse of the CHU de Lille.
 OA. Maillard is:
 a member of the Commission Emploi Recherche (CER) of Inria Lille.
11.1.6 Research administration
 Ph. Preux was deputy scientific delegate at Inria Lille until June 2022.
11.2 Teaching  Supervision  Juries
11.2.1 Teaching
 R. Akrour: Apprentissage à partir de données humaines, M1 in Cognitive Science, Université de Lille
 R. Akrour: Perception et motricité 2, L2 MIASHS, Université de Lille
 R. Akrour: Perception et motricité 1, L1 MIASHS, Université de Lille
 R. Degenne: Sequential learning, M2 MVA, ENS ParisSaclay
 R. Degenne: Sequential learning, Centrale Lille
 R. Degenne: Sciences des données 3, L3 MIASHS, Université de Lille
 E. Kaufmann: Sequential Decision Making (24h), M2 Data Science, Ecole Centrale Lille.
 OA. Maillard: Statistical Reinforcement Learning (48h), MAP/INF641, Master Artificial Intelligence and advanced Visual Computing, École Polytechnique.
 OA. Maillard: Reinforcement Learning (24h), Master 2 Artificial Intelligence, École CentraleSupélec.
 Ph. Preux: « IA et apprentissage automatique », DU IA & Santé, Université de Lille.
 Ph. Preux: « prise de décision séquentielle dans l'incertain «, M2 in Computer Science, Université de Lille.
 Ph. Preux: « apprentissage par renforcement », M2 in Computer Science, Université de Lille.
11.2.2 Supervision
 R. Akrour and Ph. Preux supervised the internship of:
 Hector Kolher, M2 computer science, Sorbonne Université, Paris,
 R. Akrour and D. Basu supervised the internship of:
 Mahdi Kallel, M2 Optimization, Institut Polytechnique de Paris, Paris,
 É. Kaufmann supervised the internship of:
 Cyrille Kone, MVA.
11.2.3 Juries
 E. Kaufmann was a member of the juries of:
 Ph.D. in CS of Léonard Blier, Université ParisSaclay
 Ph.D. in maths of Solenne Gaucher, Université ParisSaclay
 Ph.D. in CS of Geovani Rizk, Université Paris Dauphine
 Ph.D. in CS of Arnaud Delaruyelle, Université de Lille
 Ph.D. in maths of El Mehdi Saad, Université ParisSaclay
 Ph.D. in CS of Sarah Perrin, Université de Lille
 OA. Maillard was a member of the juries of:
 Ph.D. in Agronomy of Romain Gautron, Université de Montpellier
 Ph. Preux was a member of the juries of:
 Ph.D. in CS of CamilleSovanneary Gauthier, Université de Rennes
 Ph.D. in CS of Pierre Schegg, Université de Lille
 Ph.D. in CS of Johan Ferret, Université de Lille
 Ph.D. in CS of Jean Tarbouriech, Université de Lille
 Ph.D. in CS of Léonard Hussenot, Université de Lille
 Ph.D. in Agronomy of Romain Gautron, Université de Montpellier
 Ph.D. in CS of David Saltiel, Université du Littoral Côte d'Opale
 Ph.D. in CS of Sigfried Delannoy, Université du Littoral Côte d'Opale
11.3 Popularization
11.3.1 Articles and contents
 an article on the Inria website regarding the Bandits for Health project.
11.3.2 Interventions
 T. Mathieu gave a talk on « Les statistiques ne servent pas qu’à nous espionner » (statistics are not just spying on us), Université D'Anchin, Douai, Oct. 2022.
11.3.3 Other mediation actions
 Ph. Preux is involved in the Merlin project at Université de Lille on “The big investigation on AI”. The outcome of this project is a TV program produced by « L'esprit Sorcier TV », broadcasted in February 2023, and then available on replay.
 Ph. Preux is part of the scientific committee of the « Forum des Sciences » in Villeneuve d'Ascq regarding the season on AI.
12 Scientific production
12.1 Major publications
 1 inproceedingsSpectral Learning from a Single Trajectory under FiniteState Policies.International conference on Machine LearningProceedings of the International conference on Machine LearningSidney, FranceJuly 2017
 2 inproceedingsMultiPlayer Bandits Revisited.Algorithmic Learning TheoryMehryar Mohri and Karthik SridharanLanzarote, SpainApril 2018
 3 articleSequential approaches for learning datumwise sparse representations.Machine Learning8912October 2012, 87122
 4 inproceedingsOnly Relevant Information Matters: Filtering Out Noisy Samples to Boost RL.IJCAI 2020  International Joint Conference on Artificial IntelligenceYokohama, JapanJuly 2020
 5 inproceedingsOptimal Best Arm Identification with Fixed Confidence.29th Annual Conference on Learning Theory (COLT)49JMLR Workshop and Conference ProceedingsNew York, United StatesJune 2016
 6 articleOperatorvalued Kernels for Learning from Functional Response Data.Journal of Machine Learning Research17202016, 154
 7 inproceedingsMonteCarlo Tree Search by Best Arm Identification.NIPS 2017  31st Annual Conference on Neural Information Processing SystemsAdvances in Neural Information Processing SystemsLong Beach, United StatesDecember 2017, 123
 8 articleBoundary Crossing Probabilities for General Exponential Families.Mathematical Methods of Statistics272018
 9 inproceedingsTightening Exploration in Upper Confidence Reinforcement Learning.International Conference on Machine LearningVienna, AustriaJuly 2020
 10 inproceedingsImproving offline evaluation of contextual bandit algorithms via bootstrapping techniques.International Conference on Machine Learning32Journal of Machine Learning Research, Workshop and Conference Proceedings; Proceedings of The 31st International Conference on Machine LearningBeijing, ChinaJune 2014
12.2 Publications of the year
International journals
 11 articleEfficient ChangePoint Detection for Tackling PiecewiseStationary Bandits.Journal of Machine Learning ResearchMarch 2022
 12 articleTopics in robust statistical learning.ESAIM: Proceedings and Surveys2022
 13 articleReinforcement Learning for crop management.Computers and Electronics in Agriculture200July 2022, 107182
 14 articleConcentration study of Mestimators using the influence function.Electronic Journal of Statistics 161January 2022, 36953750
 15 articleSofaGym: An open platform for Reinforcement Learning based on Soft Robot simulations.Soft Robotics2022
International peerreviewed conferences
 16 inproceedingsWhen Privacy Meets Partial Information: A Refined Analysis of Differentially Private Bandits.Advances in Neural Information Processing SystemsNew Orleans, United StatesDecember 2022
 17 inproceedingsEfficient Algorithms for Extreme Bandits.International conferenece on Articifial Intelligence and Statistics (AISTATS)Proceedings of Machine Learning Research (PMLR)Virtual Conference, SpainMarch 2022
 18 inproceedingsMadPipe: Memory Aware Dynamic Programming Algorithm for Pipelined Model Parallelism.ScaDL 2022  Scalable Deep Learning over Parallel and Distributed Infrastructure  An IPDPS 2022 WorkshopProceedings of IPDPS W'22Lyon / Virtual, France2022
 19 inproceedingsOn Meritocracy in Optimal Set Selection.EAAMO 2022 Equity and Access in Algorithms, Mechanisms, and OptimizationArlington, United StatesOctober 2022
 20 inproceedingsSENTINEL: Taming Uncertainty with Ensemblebased Distributional Reinforcement Learning.UAI 2022 Proceedings of the ThirtyEighth Conference on Uncertainty in Artificial Intelligence180Proceedings of Machine Learning ResearchEindhoven, NetherlandsAugust 2022, 631640
 21 inproceedingsSAAC: Safe Reinforcement Learning as an Adversarial Game of ActorCritics.RLDM 2022  The Multidisciplinary Conference on Reinforcement Learning and Decision MakingProvidence, United StatesJune 2022
 22 inproceedingsAlgorithmic fairness verification with graphical models.AAAI2022  36th AAAI Conference on Artificial Intelligence2Virtual, United States2022
 23 inproceedingsTop Two Algorithms Revisited.NeurIPS 2022  36th Conference on Neural Information Processing SystemAdvances in Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
 24 inproceedingsChoosing Answers in epsilonBestAnswer Identification for Linear Bandits.39th International Conference on Machine Learning (ICML 2022)Baltimore, United StatesJuly 2022
 25 inproceedingsMetalearning from Learning Curves: Challenge Design and Baseline Results.IJCNN 2022  International Joint Conference on Neural NetworksPadua, ItalyIEEEJuly 2022, 18
 26 inproceedingsIMEDRL: Regret optimal learning of ergodic Markov decision processes.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsThirtysixth Conference on Neural Information Processing SystemsNewOrleans, United StatesNovember 2022
 27 inproceedingsNearOptimal Collaborative Learning in Bandits.NeurIPS 2022  36th Conference on Neural Information Processing SystemAdvances in Neural Processing SystemsNew Orleans, United StatesDecember 2022
 28 inproceedingsOffline Reinforcement Learning as AntiExploration.AAAI 2022  36th AAAI Conference on Artificial IntelligenceVancouver, CanadaFebruary 2022
 29 inproceedingsAutomated planning for robotic guidewire navigation in the coronary arteries.Robosoft 2022  International Conference on Soft RoboticsEdimbourg, United KingdomApril 2022
 30 inproceedingsNear InstanceOptimal PAC Reinforcement Learning for Deterministic MDPs.NeurIPS 2022  36th Conference on Neural Information Processing SystemAdvances in Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
 31 inproceedingsOn Elimination Strategies for Bandit FixedConfidence Identification.NeurIPS 2022  36th Conference on Neural Information Processing SystemNew Orleans, United StatesNovember 2022
 32 inproceedingsProcrastinated Tree Search: Blackbox Optimization with Delayed, Noisy, and Multifidelity Feedback.AAAI Conference on Artificial Intelligence36Proceedings of the AAAI Conference on Artificial Intelligence9Virtual, United StatesJune 2022, 1038110390
 33 inproceedingsUDO: Universal Database Optimization using Reinforcement Learning.Proceedings of the VLDB Endowment14Proceedings of the VLDB Endowment13Sydney, AustraliaVLDB EndowmentSeptember 2021, 34023414
Conferences without proceedings
 34 inproceedingsRiskSensitive Bayesian Games for MultiAgent Reinforcement Learning under Policy Uncertainty.OptLearnMAS@AAMASWorkshop on Optimization and Learning in Multiagent Systems at International Conference on Autonomous Agents and Multiagent SystemsVirtual, New ZealandMay 2022
 35 inproceedingsBilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning.EWRL 2022 – European Workshop on Reinforcement LearningMilan, ItalySeptember 2022
 36 inproceedingsRiskaware linear bandits with convex loss.European Workshop on Reinforcement LearningMilan, ItalySeptember 2022
 37 inproceedingsOptimistic PAC Reinforcement Learning: the InstanceDependent View.EWRL 2022  European Workshop on Reinforcement LearningMilan, ItalySeptember 2022
Scientific book chapters
 38 inbookFoundations and state of play.Agriculture and Digital Technology: Getting the most out of digital technology to contribute to the transition to sustainable agriculture and food systemsWhite book Inrira6INRIA2022, 3075
Doctoral dissertations and habilitation theses
 39 thesisExploration in Reinforcement Learning: Beyond Finite StateSpaces.Université de LilleMarch 2022
 40 thesisCombination of gene regulatory networks and sequential machine learning for drug repurposing.Université Paris CitéSeptember 2022
Reports & preprints
 41 miscBandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithm.March 2022
 42 reportAn Integer Linear Programming Approach for Pipelined Model Parallelism.RR9452InriaJanuary 2022
 43 miscWeight Offloading Strategies for Training Large DNN Models.February 2022
 44 miscOnline Instrumental Variable Regression: Regret Analysis and Bandit Feedback.October 2022
 45 reportEntropy Regularized Reinforcement Learning with Cascading Networks.7003Inria Lille Nord Europe  Laboratoire CRIStAL  Université de LilleSeptember 2022, 16
 46 reportgymDSSAT: a crop model turned into a Reinforcement Learning environment.RR9460Inria LilleJuly 2022, 31
 47 miscHow Biased is Your Feature?: Computing Fairness Influence Functions with Global Sensitivity Analysis.September 2022
 48 miscNonAsymptotic Analysis of a UCBbased Top Two Algorithm.October 2022
 49 miscMetalearning from Learning Curves Challenge: Lessons learned from the First Round and Design of the Second Round.August 2022
12.3 Cited publications
 50 bookMarkov Decision Processes: Discrete Stochastic Dynamic Programming.John Wiley & Sons1994
 51 unpublishedA Tour of Reinforcement Learning: The View from Continuous Control.2018, arxiv preprint 1806.09460
 52 bookReinforcement Learning: an Introduction.http://incompleteideas.net/book/thebook2nd.htmlMIT Press2018
 53 bookBandit Algorithms.Cambridge University press2019