SCOOL

SCOOL - 2025

2025Activity report‌Project-TeamSCOOL

RNSR: 202023603Y‌

Research center Inria Centre‌‌ at the University of Lille
In partnership with:‌CNRS, Université de Lille‌
Team name: Sequential decision‌‌ making under uncertainty problem
In collaboration with:Centre‌ de Recherche en Informatique,‌ Signal et Automatique de‌‌ Lille

Creation of the Project-Team: 2020 November 01‌

Each year, Inria research‌ teams publish an Activity‌‌ Report presenting their work and results over the‌ reporting period. These reports‌ follow a common structure,‌‌ with some optional sections depending on the specific‌ team. They typically begin‌ by outlining the overall‌‌ objectives and research programme, including the main research‌ themes, goals, and methodological‌ approaches. They also describe‌‌ the application domains targeted by the team, highlighting‌ the scientific or societal‌ contexts in which their‌‌ work is situated.

The reports then present the‌ highlights of the year,‌ covering major scientific achievements,‌‌ software developments, or teaching contributions. When relevant, they‌ include sections on software,‌ platforms, and open data,‌‌ detailing the tools developed and how they are‌ shared. A substantial part‌ is dedicated to new‌‌ results, where scientific contributions are described in detail,‌ often with subsections specifying‌ participants and associated keywords.‌‌

Finally, the Activity Report addresses funding, contracts, partnerships,‌ and collaborations at various‌ levels, from industrial agreements‌‌ to international cooperations. It also covers dissemination and‌ teaching activities, such as‌ participation in scientific events,‌‌ outreach, and supervision. The document concludes with a‌ presentation of scientific production,‌ including major publications and‌‌ those produced during the‌ year.

Keywords

Computer Science and Digital Science

A3.‌ Data and knowledge
A3.1. Data
A3.1.1. Modeling, representation‌
A3.1.4. Uncertain data
A3.1.11. Structured data
A3.3. Data‌ and knowledge analysis
A3.3.1. On-line analytical processing
A3.3.2.‌ Data mining
A3.3.3. Big data analysis
A3.4. Machine‌ learning and statistics
A3.5.2. Recommendation systems
A5.1. Human-Computer‌ Interaction
A8.6. Information theory
A8.11. Game Theory
A9.‌ Artificial intelligence
A9.2. Machine learning
A9.2.1. Supervised learning‌
A9.2.2. Unsupervised learning
A9.2.3. Reinforcement learning
A9.2.4. Optimization‌ and learning
A9.2.5. Bayesian methods
A9.2.6. Neural networks‌
A9.2.8. Deep learning
A9.3. Signal processing
A9.4. Natural‌ language processing
A9.7. AI algorithmics

1 Team members, visitors, external collaborators

Research Scientists‌

Riadh Akrour [INRIA, ISFP, HDR‌]
Debabrota Basu [INRIA, ISFP]‌
Remy Degenne [INRIA, ISFP]
Emilie‌ Kaufmann [CNRS, Researcher, HDR]‌
Odalric-Ambrym Maillard [INRIA, Researcher, HDR‌]
Timothee Mathieu [INRIA, Researcher]‌

Faculty Members

Philippe Preux [Team leader,‌ UNIV LILLE, Professor, HDR]
Juliette‌ Achddou [UNIV LILLE, Associate Professor]‌

Post-Doctoral Fellows

Sabrine Chebbi [INRIA, Post-Doctoral‌ Fellow]
Lorenzo Hermez [INRIA, Post-Doctoral‌ Fellow, from Nov 2025]
Tanguy Lefort‌ [INRIA, Post-Doctoral Fellow, until Feb‌ 2025]

PhD Students

Ayoub Ajarra [INRIA‌]
Mickael Basson [LILLY FRANCE, CIFRE‌]
Yann Berthelot [SAINT GOBAIN RESEARCH,‌ CIFRE]
Hadrien Crassous [INRIA, from‌ Nov 2025]
Udvas Das [INRIA]‌
Brahim Driss [INRIA]
Anthony Kobanda [‌UBISOFT, CIFRE]
Hector Kohler [UNIV‌ LILLE, until Sep 2025]
Penanklihi Cyrille‌ Kone [INRIA]
Matheus Medeiros Centa [‌UNIV LILLE, until Mar 2025]
Nicolas‌ Michalak [INRIA]
Thomas Michel [INRIA‌]
Adrien Prevost [INRIA]
Waris Radji‌ [INRIA]
Adrienne Tuynman [INRIA]‌
Sumit Vashishtha [UNIV LILLE]
Redouane Yagouti‌ [INRIA, from Nov 2025]

Technical‌ Staff

Alex Davey [INRIA, Engineer,‌ until Sep 2025]
Guillaume Pourcel [INRIA‌, Engineer, until Apr 2025]
Julien‌ Teigny [INRIA, Engineer]

Interns and‌ Apprentices

Francesca Hemetsberger [INRIA, Intern,‌ from Sep 2025]
Aurele Mingam [INRIA‌, Intern, from Apr 2025 until Aug‌ 2025]
Francois Muller [INRIA, Intern‌, from Apr 2025 until Sep 2025]‌
Balthazar Tack [CENTRALE LILLE, Intern,‌ from Apr 2025 until Aug 2025]
Redouane‌ Yagouti [INRIA, Intern, from May‌ 2025 until Oct 2025]

Administrative Assistant

Amélie‌ Supervielle [INRIA]

2 Overall objectives

Scool‌ is a machine learning (ML) research group. Scool's‌ research focuses on the study of the sequential‌ decision making under uncertainty problem (SDMUP). In particular, we consider bandit problems‌ 57 and the reinforcement‌ learning (RL) problem 56‌‌. In a simplified way, RL considers the‌ problem of learning an‌ optimal policy in a‌‌ Markov Decision Problem (MDP) 54; when the‌ set of states collapses‌ to a single state,‌‌ this is known as the bandit problem which‌ focuses on the exploration/exploitation‌ problem.

Bandit and RL‌‌ problems are interesting to study on their own;‌ both types of problems‌ share a number of‌‌ fundamental issues (convergence analysis, sample complexity, representation, safety,‌ etc.); both problems‌ have real life applications,‌‌ different though closely related; the fact that while‌ solving an RL problem,‌ one faces an exploration/exploitation‌‌ problem and has to solve a bandit problem‌ in each state connects‌ the two types of‌‌ problems very intimately.

In our work, we also‌ consider settings going beyond‌ the Markovian assumption, in‌‌ particular non-stationary settings, which represent a challenge common‌ to bandits and RL.‌ A distinctive aspect of‌‌ the SDMUP with regards to the rest of‌ the field of ML‌ is that the learning‌‌ problem takes place within a closed-loop interaction between‌ a learning agent and‌ its environment. This feedback‌‌ loop makes our field of research very different‌ from the two other‌ sub-fields of ML, supervised‌‌ and unsupervised learning, even when they are defined‌ in an incremental setting.‌ Hence, SDMUP combines ML‌‌ with control: the learner is not passive, the‌ learner acts on its‌ environment, and learns from‌‌ the consequences of these interactions; hence, the learner‌ can act in order‌ to obtain information from‌‌ the environment. Naturally, the optimal control community is‌ getting more and more‌ interested by RL (see‌‌ e.g. 55).

We wish to go on,‌ studying applied questions and‌ developing theory to come‌‌ up with sound approaches to the practical resolution‌ of SDMUP tasks, and‌ guide their resolution. Non-stationary‌‌ environments are a particularly interesting setting; we are‌ studying this setting and‌ developing new tools to‌‌ approach it in a sound way, in order‌ to have algorithms to‌ detect environment changes as‌‌ fast as possible, and as reliably as possible,‌ adapt to them, and‌ prove their behavior, in‌‌ terms of their performance, measured with the regret‌ for instance. We mostly‌ consider non parametric statistical‌‌ models, that is models in which the number‌ of parameters is not‌ fixed (a parameter may‌‌ be of any type: a scalar, a vector,‌ a function, etc.),‌ so that the model‌‌ can adapt along learning, and to its changing‌ environment; this also lets‌ the algorithm learn a‌‌ representation that fits its environment.

3 Research program‌

Our research is mostly‌ dealing with bandit problems,‌‌ and reinforcement learning problems. We investigate each thread‌ separately and also in‌ combination, since the management‌‌ of the exploration/exploitation trade-off is a major issue‌ in reinforcement learning.

On‌ bandit problems, we focus‌‌ on:

structured bandits
bandits for planning (in particular‌ for Monte Carlo Tree‌ Search (MCTS))
non stationary‌‌ bandits

Regarding reinforcement learning,‌ we focus on:

modeling issues, and dealing with‌ the discrepancy between the model and the task‌ to solve
learning and using the structure of‌ a Markov decision problem, and of the learned‌ policy
generalization in reinforcement learning
reinforcement learning in‌ non stationary environments

Beyond these objectives, we put‌ a particular emphasis on the study of non-stationary‌ environments. Another area of great concern is the‌ combination of symbolic methods with numerical methods, be‌ it to provide knowledge to the learning algorithm‌ to improve its learning curve, or to better‌ understand what the algorithm has learned and explain‌ its behavior, or to rely on causality rather‌ than on mere correlation.

We also put a‌ particular emphasis on real applications and how to‌ deal with their constraints: lack of a simulator,‌ difficulty to have a realistic model of the‌ problem, small amount of data, dealing with risks,‌ availability of expert knowledge on the task.

4‌ Application domains

Scool has 2 main topics of‌ application:

health
sustainable development

In each of these‌ two domains, we put forward the investigation and‌ the application of the idea of sequential decision‌ making under uncertainty. Though supervised and non supervised‌ learning have already been studied and applied extensively,‌ sequential decision making remains far less studied; bandits‌ have already been used in many applications of‌ e-commerce (e.g. for computational advertising and recommendation systems).‌ However, in applications where human beings may be‌ severely impacted, bandits and reinforcement learning have not‌ been studied much; moreover, these applications come along‌ with a scarcity of data, and the non‌ availability of a simulator, which prevents heavy computational‌ simulations to come up with safe automatic decision‌ making.

In 2022, in health, we investigated patient‌ follow-up with Prof. F. Pattou's research group (CHU‌ Lille, Inserm, Université de Lille) in project B4H.‌ This effort came along with investigating how we‌ may use medical data available locally at CHU‌ Lille, and also the national social security data.‌ We also investigated drug repurposing with Prof. A.‌ Delahaye-Duriez (Inserm, Université de Paris) in project Repos.‌ We also studied catheter control by way of‌ reinforcement learning with Inria Lille group Defrost, and‌ company Robocath (Rouen).

Regarding sustainable development, we have‌ a set of projects and collaborations regarding agriculture‌ and gardening. With Cirad and CGIAR, we investigate‌ how one may recommend agricultural practices to farmers‌ in developing countries. Through an associate team with‌ Bihar Agriculture University (India), we investigate data collection.‌ Inria exploratory action SR4SG concerns recommender systems at‌ the level of individual gardens.

There are two‌ important aspects that are amply shared by these‌ two application fields. First, we consider that data‌ collection is an active task: we do not‌ passively observe and record data, we design methods‌ and algorithms to search for useful data. This‌ idea is exploited in most of these works‌ oriented towards applications. Second, many of these projects‌ include a careful management of risks for human beings. We have to‌ take decisions taking care‌ of their consequences on‌‌ human beings, on eco-systems and life more generally.‌

5 Social and environmental‌ responsibility

Sustainable development is‌‌ a major field of research and application of‌ Scool. We investigate what‌ machine learning can bring‌‌ to sustainable development, identifiying challenges and obstacles, and‌ studying how to overcome‌ them.

Let us mention‌‌ here:

sustainable agriculture in developing countries;
sustainable gardening.‌

More details can be‌ found in Section 4‌‌.

6 Highlights of the year

The Scool‌ team was selected to‌ organize the 19th edition‌‌ of the European Workshop on Reinforcement Learning in‌ October 2026.

Debabrota Basu‌ got the "top reviewer"‌‌ recognition from NeurIPS and was awarded a complementary‌ registration.

7 New results‌

We organize our research‌‌ results in a set of categories. The main‌ categories are: bandits and‌ RL theory, bandits and‌‌ RL under real life constraints, and applications.

Participants:‌ Adrienne Tuynman, Alex‌ Davey, Anthony Kobanda‌‌, Ayoub Ajarra, Brahim Driss, Debabrota‌ Basu, Emilie Kaufmann‌, Guillaume Pourcel,‌‌ Hector Kohler, Mickael Basson, Odalric-Ambrym Maillard‌, Penanklihi Cyrille Kone‌, Philippe Preux,‌‌ Remy Degenne, Riadh Akrour, Sumit Vashishtha‌, Thomas Michel,‌ Udvas Das, Waris‌‌ Radji.

7.1 Bandits and RL theory

Best-Arm‌ Identification in Unimodal Bandits‌, 39

We study‌‌ the fixed-confidence best-arm identification problem in unimodal bandits,‌ in which the means‌ of the arms increase‌‌ with the index of the arm up to‌ their maximum, then decrease.‌ We derive two lower‌‌ bounds on the stopping time of any algorithm.‌ The instance-dependent lower bound‌ suggests that due to‌‌ the unimodal structure, only three arms contribute to‌ the leading confidence-dependent cost.‌ However, a worst-case lower‌‌ bound shows that a linear dependence on the‌ number of arms is‌ unavoidable in the confidence-independent‌‌ cost. We propose modifications of Track-and-Stop and a‌ Top Two algorithm that‌ leverage the unimodal structure.‌‌ Both versions of Track-and-Stop are asymptotically optimal for‌ one-parameter exponential families. The‌ Top Two algorithm is‌‌ asymptotically near-optimal for Gaussian distributions and we prove‌ a non-asymptotic guarantee matching‌ the worse-case lower bound.‌‌ The algorithms can be implemented efficiently and we‌ demonstrate their competitive empirical‌ performance.

The Batch Complexity‌‌ of Bandit Pure Exploration, 40

In a‌ fixed-confidence pure exploration problem‌ in stochastic multi-armed bandits,‌‌ an algorithm iteratively samples arms and should stop‌ as early as possible‌ and return the correct‌‌ answer to a query about the arms distributions.‌ We are interested in‌ batched methods, which change‌‌ their sampling behaviour only a few times, between‌ batches of observations. We‌ give an instance-dependent lower‌‌ bound on the number of batches used by‌ any sample efficient algorithm‌ for any pure exploration‌‌ task. We then give a general batched algorithm‌ and prove upper bounds‌ on its expected sample‌‌ complexity and batch complexity. We illustrate both lower‌ and upper bounds on‌ best-arm identification and thresholding‌‌ bandits.

Pareto Set Identification‌ With Posterior Sampling, 33

The problem of‌ identifying the best answer among a collection of‌ items having real-valued distribution is well-understood. Despite its‌ practical relevance for many applications, fewer works have‌ studied its extension when multiple and potentially conflicting‌ metrics are available to assess an item's quality.‌ Pareto set identification (PSI) aims to identify the‌ set of answers whose means are not uniformly‌ worse than another. This paper studies PSI in‌ the transductive linear setting with potentially correlated objectives.‌ Building on posterior sampling in both the stopping‌ and the sampling rules, we propose the PSIPS‌ algorithm that deals simultaneously with structure and correlation‌ without paying the computational cost of existing oraclebased‌ algorithms. Both from a frequentist and Bayesian perspective,‌ PSIPS is asymptotically optimal. We demonstrate its good‌ empirical performance in real-world and synthetic instances.

FraPPE:‌ Fast and Efficient Preference-based Pure Exploration, 30‌

Preference-based Pure Exploration (PrePEx) aims to identify with‌ a given confidence level the set of Pareto‌ optimal arms in a vector-valued (aka multi-objective) bandit,‌ where the reward vectors are ordered via a‌ (given) preference cone $𝒞$ . Though PrePEx and‌ its variants are well-studied, there does not exist‌ a computationally efficient algorithm that can optimally track‌ the existing lower bound for arbitrary preference cones.‌ We successfully fill this gap by efficiently solving‌ the minimisation and maximisation problems in the lower‌ bound. First, we derive three structural properties of‌ the lower bound that yield a computationally tractable‌ reduction of the minimisation problem. Then, we deploy‌ a Frank-Wolfe optimiser to accelerate the maximisation problem‌ in the lower bound. Together, these techniques solve‌ the maxmin optimisation problem in $𝒪 (K‌ L^{2})$ time for a bandit instance‌ with $K$ arms and $L$ dimensional reward, which‌ is a significant acceleration over the literature. We‌ further prove that our proposed PrePEx algorithm, FraPPE,‌ asymptotically achieves the optimal sample complexity. Finally, we‌ perform numerical experiments across synthetic and real datasets‌ demonstrating that FraPPE achieves the lowest sample complexities‌ to identify the exact Pareto set among the‌ existing algorithms.

Leveraging Priors on Distribution Functions for‌ Multi-Arm Bandits, 21

We introduce Dirichlet Process‌ Posterior Sampling (DPPS), a Bayesian non-parametric algorithm for‌ multi-arm bandits based on Dirichlet Process (DP) priors.‌ Like Thompson-sampling, DPPS is a probability-matching algorithm, i.e.,‌ it plays an arm based on its posterior‌ probability of being optimal. Instead of assuming a‌ parametric class for the reward generating distribution of‌ each arm, and then putting a prior on‌ the parameters, in DPPS the reward generating distribution‌ is directly modeled using DP priors. DPPS provides‌ a principled approach to incorporate prior belief about‌ the bandit environment, and in the noninformative limit‌ of the DP priors (i.e. Bayesian Bootstrap), we‌ recover Non Parametric Thompson Sampling (NPTS), a popular‌ non-parametric bandit algorithm, as a special case of‌ DPPS. We employ stick-breaking representation of the DP‌ priors, and show excellent empirical performance of DPPS in challenging synthetic and‌ real world bandit environments.‌ Finally, using an information-theoretic‌‌ analysis, we show non-asymptotic optimality of DPPS in‌ the Bayesian regret setup.‌

Towards Blackwell Optimality: Bellman‌‌ Optimality Is All You Can Get, 48‌

Although average gain optimality‌ is a commonly adopted‌‌ performance measure in Markov Decision Processes (MDPs), it‌ is often too asymptotic.‌ Further incorporating measures of‌‌ immediate losses leads to the hierarchy of bias‌ optimalities, all the way‌ up to Blackwell optimality.‌‌ In this paper, we investigate the problem of‌ identifying policies of such‌ optimality orders. To that‌‌ end, for each order, we construct a learning‌ algorithm with vanishing probability‌ of error. Furthermore, we‌‌ characterize the class of MDPs for which identification‌ algorithms can stop in‌ finite time. That class‌‌ corresponds to the MDPs with a unique Bellman‌ optimal policy, and does‌ not depend on the‌‌ optimality order considered. Lastly, we provide a tractable‌ stopping rule that when‌ coupled to our learning‌‌ algorithm triggers in finite time whenever it is‌ possible to do so.‌

7.2 Bandits and RL‌‌ under Real-life constraints

Constrained Pareto Set Identification with‌ Bandit Feedback, 35‌

In this paper, we‌‌ address the problem of identifying the Pareto Set‌ under feasibility constraints in‌ a multivariate bandit setting.‌‌ Specifically, given a K-armed bandit with unknown means‌ in $ℝ^{d}$ ,‌ the goal is to‌‌ identify the set of arms whose mean is‌ not uniformly worse than‌ that of another arm‌‌ (i.e., not smaller for all objectives), while satisfying‌ some known set of‌ linear constraints, expressing, for‌‌ example, some minimal performance on each objective. Our‌ focus lies in fixed-confidence‌ identification, for which we‌‌ introduce an algorithm that significantly outperforms racing-like algorithms‌ and the intuitive two-stage‌ approach that first identifies‌‌ feasible arms and then their Pareto Set. We‌ further prove an information-theoretic‌ lower bound on the‌‌ sample complexity of any algorithm for constrained Pareto‌ Set identification, showing that‌ the sample complexity of‌‌ our approach is near-optimal. Our theoretical results are‌ supported by an extensive‌ empirical evaluation on a‌‌ series of benchmarks.

Bandit Pareto Set Identification in‌ a Multi-Output Linear Model‌, 34

We study‌‌ the Pareto Set Identification (PSI) problem in a‌ structured multi-output linear bandit‌ model. In this setting,‌‌ each arm is associated a feature vector belonging‌ to $ℝ^{h}$ ,‌ and its mean vector‌‌ in $ℝ^{d}$ linearly depends on this feature‌ vector through a common‌ unknown matrix $Θ \in‌‌ ℝ^{h \times d}$ . The goal is‌ to identify the set‌ of non-dominated arms by‌‌ adaptively collecting samples from the arms. We introduce‌ and analyze the first‌ optimal design-based algorithms for‌‌ PSI, providing nearly optimal guarantees in both the‌ fixed-budget and the fixed-confidence‌ settings. Notably, we show‌‌ that the difficulty of these tasks mainly depends‌ on the sub-optimality gaps‌ of h arms only.‌‌ Our theoretical results are supported by an extensive‌ benchmark on synthetic and‌ real-world datasets.

Kriging and‌‌ Gaussian Process Interpolation for‌ Georeferenced Data Augmentation, 51

Data augmentation is‌ a crucial step in the development of robust‌ supervised learning models, especially when dealing with limited‌ datasets. This study explores interpolation techniques for the‌ augmentation of geo-referenced data, with the aim of‌ predicting the presence of Commelina benghalensis L. in‌ sugarcane plots in La Réunion. Given the spatial‌ nature of the data and the high cost‌ of data collection, we evaluated two interpolation approaches:‌ Gaussian processes (GPs) with different kernels and kriging‌ with various variograms. The objectives of this work‌ are threefold: (i) to identify which interpolation methods‌ offer the best predictive performance for various regression‌ algorithms, (ii) to analyze the evolution of performance‌ as a function of the number of observations‌ added, and (iii) to assess the spatial consistency‌ of augmented datasets. The results show that GP-based‌ methods, in particular with combined kernels (GP-COMB), significantly‌ improve the performance of regression algorithms while requiring‌ less additional data. Although kriging shows slightly lower‌ performance, it is distinguished by a more homogeneous‌ spatial coverage, a potential advantage in certain contexts.‌

Optimal Regret of Bandits under Differential Privacy,‌ 25

As sequential learning algorithms are increasingly applied‌ to real life, ensuring data privacy while maintaining‌ their utilities emerges as a timely question. In‌ this context, regret minimisation in stochastic bandits under‌ $ϵ$ -global Differential Privacy (DP) has been widely‌ studied. Unlike bandits without DP, there is a‌ significant gap between the best-known regret lower and‌ upper bound in this setting, though they ”match”‌ in order. Thus, we revisit the regret lower‌ and upper bounds of $ϵ$ -global DP algorithms‌ for Bernoulli bandits and improve both. First, we‌ prove a tighter regret lower bound involving a‌ novel information-theoretic quantity characterising the hardness of $ϵ‌$ -global DP in stochastic bandits. Our lower bound‌ strictly improves on the existing ones across all‌ $ϵ$ values. Then, we choose two asymptotically optimal‌ bandit algorithms, i.e. DP-KLUCB and DP-IMED, and propose‌ their DP versions using a unified blueprint, i.e.,‌ (a) running in arm-dependent phases, and (b) adding‌ Laplace noise to achieve privacy. For Bernoulli bandits,‌ we analyse the regrets of these algorithms and‌ show that their regrets asymptotically match our lower‌ bound up to a constant arbitrary close to‌ 1. This refutes the conjecture that forgetting past‌ rewards is necessary to design optimal bandit algorithms‌ under global DP. At the core of our‌ algorithms lies a new concentration inequality for sums‌ of Bernoulli variables under Laplace mechanism, which is‌ a new DP version of the Chernoff bound.‌ This result is universally useful as the DP‌ literature commonly treats the concentrations of Laplace noise‌ and random variables separately, while we couple them‌ to yield a tighter bound.

FLIPHAT: Joint Differential‌ Privacy for High Dimensional Sparse Linear Bandits,‌ 28

High dimensional sparse linear bandits serve as‌ an efficient model for sequential decision-making problems (e.g.‌ personalized medicine), where high dimensional features (e.g. genomic data) on the users‌ are available, but only‌ a small subset of‌‌ them are relevant. Motivated by data privacy concerns‌ in these applications, we‌ study the joint differentially‌‌ private high dimensional sparse linear bandits, where both‌ rewards and contexts are‌ considered as private data.‌‌ First, to quantify the cost of privacy, we‌ derive a lower bound‌ on the regret achievable‌‌ in this setting. To further address the problem,‌ we design a computationally‌ efficient bandit algorithm, F‌‌orgetfuLIterative Private HArd‌ Thresholding (FLIPHAT). Along‌ with doubling of episodes‌‌ and episodic forgetting, FLIPHAT deploys a variant of‌ Noisy Iterative Hard Thresholding‌ (N-IHT) algorithm as a‌‌ sparse linear regression oracle to ensure both privacy‌ and regret-optimality. We show‌ that FLIPHAT achieves optimal‌‌ regret up to logarithmic factors. We analyze the‌ regret by providing a‌ novel refined analysis of‌‌ the estimation error of N-IHT, which is of‌ parallel interest.

Stochastic Online‌ Instrumental Variable Regression: Regrets‌‌ for Endogeneity and Bandit Feedback, 31

The‌ independence of noise and‌ covariates is a standard‌‌ assumption in online linear regression with unbounded noise‌ and linear bandit literature.‌ This assumption and the‌‌ following analysis are invalid in the case of‌ endogeneity, i.e., when the‌ noise and covariates are‌‌ correlated. In this paper, we study the online‌ setting of Instrumental Variable‌ (IV) regression, which is‌‌ widely used in economics to identify the underlying‌ model from an endogenous‌ dataset. Specifically, we upper‌‌ bound the identification and oracle regrets of the‌ popular Two-Stage Least Squares‌ (2SLS) approach to IV‌‌ regression but in the online setting. Our analysis‌ shows that Online 2SLS‌ (O2SLS) achieves $𝒪 (‌‌ d^{2} {log}^{2} T)$ identification and‌ $𝒪 (γ \sqrt{d‌ T log T})‌‌$ oracle regret after $T$ interactions, where $d$ is‌ the dimension of covariates‌ and $γ$ is the‌‌ bias due to endogeneity. Then, we leverage O2SLS‌ as an oracle to‌ design OFUL-IV, a linear‌‌ bandit algorithm. OFUL-IV can tackle endogeneity and achieves‌ $𝒪 (d \sqrt{T‌} log T)$ regret.‌‌ For different datasets with endogeneity, we experimentally show‌ efficiencies of O2SLS and‌ OFUL-IV.

Unifying (Federated) (Private)‌‌ High-Dimensional Bandits via ADMM, 45

We study‌ all possible variants of‌ the high dimensional stochastic‌‌ linear contextual bandit problem in federated and private‌ settings. We propose a‌ unifying algorithm design and‌‌ analysis framework built on ADMM. Our method achieves‌ existing state-of-the art guarantees‌ in either setting for‌‌ the central model. For the federated model, our‌ results are entirely new‌ and near-optimal in either‌‌ setting. We also establish a novel lower bound‌ on privacy-utility trade-off for‌ the federated model in‌‌ the private setting and demonstrate on suitable numerical‌ experiments for all problem‌ variants.

The Confusing Instance‌‌ Principle for Online Linear Quadratic Control, 19‌

We revisit the problem‌ of controlling linear systems‌‌ with quadratic costunder unknown dynamics with model-based reinforcement‌ learning. Traditional methods like‌ Optimism in the Face‌‌ of Uncertainty and Thompson‌ Sampling, rooted in multi-armed bandits (MABs), face practical‌ limitations. In contrast, we propose an alternative based‌ on the Confusing Instance (CI) principle, which underpins‌ regret lower bounds in MABsand discrete Markov Decision‌ Processes (MDPs) and is central to the Minimum‌ Empirical Divergence (MED) family of algorithms, known fortheir‌ asymptotic optimality in various settings. By leveraging the‌ structure of LQR policies along with sensitivity and‌ stability analysis, we develop MED-LQ. This novel control‌ strategy extends the principles of CI and MED‌ beyond small-scale settings. Our benchmarks on a comprehensive‌ control suite demonstrate that MED-LQ achieves competitive performance‌ in various scenarios while highlighting its potential for‌ broader applications in large-scale MDPs.

7.3 Bandits and‌ RL for real-life: Deep RL and Applications

Breiman‌ meets Bellman: Non-Greedy Decision Trees with MDPs,‌ 32

In supervised learning, decision trees are valued‌ for their inter-pretability and performance. While greedy decision‌ tree algorithms like CART remain widely used due‌ to their computational efficiency,they often produce sub-optimal solutions‌ with respect to a regularized training loss. Conversely,‌ optimal decision tree method scan find better solutions‌ but are computationally intensive and typically limited to‌ shallow trees or binary features. We present Dynamic‌ Programming Decision Trees (DPDT), a framework that bridges‌ the gap between greedy and optimal approaches. DPDT‌ relies on a Markov Decision Process formulation combined‌ with heuristic split generation to construct near-optimal decision‌ trees with significantly reduced computational complexity. Our approach‌ dynamically limits the set of admissible splits at‌ each node while directly optimizing the tree regularized‌ training loss. Theoretical analysis demonstrates that DPDT can‌ minimize regularized training losses at least as well‌ as CART. Our empirical study shows on multiple‌ datasets that DPDT achieves near-optimal loss with orders‌ of magnitude fewer operations than existing optimal solvers.‌ More importantly, extensive benchmarking suggests statistically significant improvements‌ of DPDT over both CART and optimal decision‌ trees in terms of generalization to unseen data.‌ We demonstrate DPDT practicality through applications to boosting,‌ where it consistently outperforms baselines. Our framework provides‌ a promising direction for developing efficient, near-optimal decision‌ tree algorithms that scale to practical applications.

How‌ Hard is it to Confuse a World Model?‌, 53

In reinforcement learning (RL) theory, the‌ concept of most confusing instances is central to‌ establishing regret lower bounds, that is, the minimal‌ exploration needed to solve a problem. Given a‌ reference model and its optimal policy, a most‌ confusing instance is the statistically closest alternative model‌ that makes a suboptimal policy optimal. While this‌ concept is well-studied in multi-armed bandits and ergodic‌ tabular Markov decision processes, constructing such instances remains‌ an open question in the general case. In‌ this paper, we formalize this problem for neural‌ network world models as a constrained optimization: finding‌ a modified model that is statistically close to‌ the reference one, while producing divergent performance between‌ optimal and suboptimal policies. We propose an adversarial‌ training procedure to solve this problem and conduct an empirical study across‌ world models of varying‌ quality. Our results suggest‌‌ that the degree of achievable confusion correlates with‌ uncertainty in the approximate‌ model, which may inform‌‌ theoretically-grounded exploration strategies for deep model-based RL.

Hierarchical‌ Subspaces of Policies for‌ Continual Offline Reinforcement Learning‌‌, 43

We consider a Continual Reinforcement Learning‌ setup, where a learning‌ agent must continuously adapt‌‌ to new tasks while retaining previously acquired skill‌ sets, with a focus‌ on the challenge of‌‌ avoiding forgetting past gathered knowledge and ensuring scalability‌ with the growing number‌ of tasks. Such issues‌‌ prevail in autonomous robotics and video game simulations,‌ notably for navigation tasks‌ prone to topological or‌‌ kinematic changes. To address these issues, we introduce‌ HiSPO, a novel hierarchical‌ framework designed specifically for‌‌ continual learning in navigation settings from offline data.‌ Our method leverages distinct‌ policy subspaces of neural‌‌ networks to enable flexible and efficient adaptation to‌ new tasks while preserving‌ existing knowledge. We demonstrate,‌‌ through a careful experimental study, the effectiveness of‌ our method in both‌ classical MuJoCo maze environments‌‌ and complex video game-like navigation simulations, showcasing competitive‌ performances and satisfying adaptability‌ with respect to classical‌‌ continual learning metrics, in particular regarding the memory‌ usage and efficiency.

A‌ Continual Offline Reinforcement Learning‌‌ Benchmark for Navigation Tasks, 42

Autonomous agents‌ operating in domains such‌ as robotics or video‌‌ game simulations must adapt to changing tasks without‌ forgetting about the previous‌ ones. This process called‌‌ Continual Reinforcement Learning poses non-trivial difficulties, from preventing‌ catastrophic forgetting to ensuring‌ the scalability of the‌‌ approaches considered. Building on recent advances, we introduce‌ a benchmark providing a‌ suite of video-game navigation‌‌ scenarios, thus filling a gap in the literature‌ and capturing key challenges‌ : catastrophic forgetting, task‌‌ adaptation, and memory efficiency. We define a set‌ of various tasks and‌ datasets, evaluation protocols, and‌‌ metrics to assess the performance of algorithms, including‌ state-of-the-art baselines. Our benchmark‌ is designed not only‌‌ to foster reproducible research and to accelerate progress‌ in continual reinforcement learning‌ for gaming, but also‌‌ to provide a reproducible framework for production pipelines‌ – helping practitioners to‌ identify and to apply‌‌ effective approaches.

StaQ it! Growing neural networks for‌ Policy Mirror Descent,‌ 44

In Reinforcement Learning‌‌ (RL), regularization has emerged as a popular tool‌ both in theory and‌ practice, typically based either‌‌ on an entropy bonus or a Kullback-Leibler divergence‌ that constrains successive policies.‌ In practice, these approaches‌‌ have been shown to improve exploration, robustness and‌ stability, giving rise to‌ popular Deep RL algorithms‌‌ such as SAC and TRPO. Policy Mirror Descent‌ (PMD) is a theoretical‌ framework that solves this‌‌ general regularized policy optimization problem, however the closed-form‌ solution involves the sum‌ of all past Q-functions,‌‌ which is intractable in practice. We propose and‌ analyze PMD-like algorithms that‌ only keep the last‌‌ $M$ Q-functions in memory, and show that for‌ finite and large enough‌ $M$ , a convergent‌‌ algorithm can be derived,‌ introducing no error in the policy update, unlike‌ prior deep RL PMD implementations. StaQ, the resulting‌ algorithm, enjoys strong theoretical guarantees and is competitive‌ with deep RL baselines, while exhibiting less performance‌ oscillation, paving the way for fully stable deep‌ RL algorithms and providing a testbed for experimentation‌ with Policy Mirror Descent.

PB $^{2}$ : Preference‌ Space Exploration via Population-Based Methods in Preference-Based Reinforcement‌ Learning, 41

Preference-based reinforcement learning (PbRL) has‌ emerged as a promising approach for learning behaviors‌ from human feedback without predefined reward functions. However,‌ current PbRL methods face a critical challenge in‌ effectively exploring the preference space, often converging prematurely‌ to suboptimal policies that satisfy only a narrow‌ subset of human preferences. In this work, we‌ identify and address this preference exploration problem through‌ population-based methods. We demonstrate that maintaining a diverse‌ population of agents enables more comprehensive exploration of‌ the preference landscape compared to single-agent approaches. Crucially,‌ this diversity improves reward model learning by generating‌ preference queries with clearly distinguishable behaviors, a key‌ factor in real-world scenarios where humans must easily‌ differentiate between options to provide meaningful feedback. Our‌ experiments reveal that current methods may fail by‌ getting stuck in local optima, requiring excessive feedback,‌ or degrading significantly when human evaluators make errors‌ on similar trajectories, a realistic scenario often overlooked‌ by methods relying on perfect oracle teachers. Our‌ population-based approach demonstrates robust performance when teachers mislabel‌ similar trajectory segments and shows significantly enhanced preference‌ exploration capabilities, particularly in environments with complex reward‌ landscapes

Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary‌ conditions & equivalence with Hamiltonian Echo Learning,‌ 52

Equilibrium Propagation (EP) is a learning algorithm‌ for training Energy-based Models (EBMs) on static inputs‌ which leverages the variational description of their fixed‌ points. Extending EP to time-varying inputs is a‌ challenging problem, as the variational description must apply‌ to the entire system trajectory rather than just‌ fixed points, and careful consideration of boundary conditions‌ becomes essential. In this work, we present Generalized‌ Lagrangian Equilibrium Propagation (GLEP), which extends the variational‌ formulation of EP to time-varying inputs. We demonstrate‌ that GLEP yields different learning algorithms depending on‌ the boundary conditions of the system, many of‌ which are impractical for implementation. We then show‌ that Hamiltonian Echo Learning (HEL) – which includes‌ the recently proposed Recurrent HEL (RHEL) and the‌ earlier known Hamiltonian Echo Backpropagation (HEB) algorithms –‌ can be derived as a special case of‌ GLEP. Notably, HEL is the only instance of‌ GLEP we found that inherits the properties that‌ make EP a desirable alternative to backpropagation for‌ hardware implementations: it operates in a ”forward-only” manner‌ (i.e. using the same system for both inference‌ and learning), it scales efficiently (requiring only two‌ or more passes through the system regardless of‌ model size), and enables local learning.

Studying Exploration‌ in RL: An Optimal Transport Analysis of Occupancy‌ Measure Trajectories, 18

The rising successes of RL are propelled by‌ combining smart algorithmic strategies‌ and deep architectures to‌‌ optimize the distribution of returns and visitations over‌ the state-action space. A‌ quantitative framework to compare‌‌ the learning processes of these eclectic RL algorithms‌ is currently absent but‌ desired in practice. We‌‌ address this gap by representing the learning process‌ of an RL algorithm‌ as a sequence of‌‌ policies generated during training, and then studying the‌ policy trajectory induced in‌ the manifold of state-action‌‌ occupancy measures. Using an optimal transport-based metric, we‌ measure the length of‌ the paths induced by‌‌ the policy sequence yielded by an RL algorithm‌ between an initial policy‌ and a final optimal‌‌ policy. Hence, we first define the Effort of‌ Sequential Learning (ESL). ESL‌ quantifies the relative distance‌‌ that an RL algorithm travels compared to the‌ shortest path from the‌ initial to the optimal‌‌ policy. Furthermore, we connect the dynamics of policies‌ in the occupancy measure‌ space and regret (another‌‌ metric to understand the suboptimality of an RL‌ algorithm), by defining the‌ Optimal Movement Ratio (OMR).‌‌ OMR assesses the fraction of movements in the‌ occupancy measure space that‌ effectively reduce an analogue‌‌ of regret. Finally, we derive approximation guarantees to‌ estimate ESL and OMR‌ with a finite number‌‌ of samples and without access to an optimal‌ policy. Through empirical analyses‌ across various environments and‌‌ algorithms, we demonstrate that ESL and OMR provide‌ insights into the exploration‌ processes of RL algorithms‌‌ and the hardness of different tasks in discrete‌ and continuous MDPs.

7.4‌ Others

Improving Diffusion Models‌‌ for the Traveling Salesman Problem (TSP) by Leveraging‌ the Structure of the‌ Solution Space, 26‌‌

In this paper we show how leveraging the‌ structure of the solution‌ space of the Traveling‌‌ Salesman Problem (TSP) can lead to a dramatic‌ improvement of the performance‌ of state of the‌‌ art diffusion based neural solvers. Building on recent‌ approaches of DIFUSCO and‌ T2TCO which pipeline a‌‌ diffusion-based solution generation with a local search procedure,‌ we propose IDEQ (constrained‌ Inverse Diffusion and EQuivalence‌‌ class-based training of diffusion models for combinatorial optimization).‌ IDEQ improves the quality‌ of the solutions by‌‌ leveraging the constrained structure of the TSP state‌ space. Indeed, the solution‌ space consists of locally‌‌ optimal Hamiltonian tours which is a much smaller‌ space than the space‌ of adjacency matrices used‌‌ in previous works. Also, the local search procedure‌ defines an equivalence class‌ of Hamiltonian tours: all‌‌ elements of this equivalence class reach the same‌ local optimum after the‌ application of the local‌‌ search. This should be aligned with the supervised‌ training objective of the‌ diffusion. IDEQ addresses these‌‌ two points. Our experiments show that IDEQ achieves‌ 0.3% to 0.4% optimality‌ gap on TSP instances‌‌ made of 500 cities, and 0.5% to 0.6%‌ optimality gap on TSP‌ instances with 1000 cities.‌‌ This sets a new SOTA for neural based‌ methods solving the TSP.‌ IDEQ also performs well‌‌ on the instances of‌ the TSPlib, a reference benchmark in the TSP‌ community, outside of the training distribution, with optimality‌ gaps ranging from 0.9 to 1.1 %.

Yara:‌ An Ocean Virtual Environment for Research and Development‌ of Autonomous Sailing Robots and Other Unmanned Surface‌ Vessels, 20

Overall, a big challenge in‌ building a sailboat USV relies on the development‌ of an autonomous system for guidance, navigation, and‌ control (GNC) because both sail and rudder angle‌ must be cooperatively adjusted to correct the navigation‌ direction -traditional propelled boats can be more easily‌ controlled with a straightforward control task to set‌ the rudder angle. Moreover, sailing upwind requires special‌ maneuvers to reach a given target in that‌ unfeasible direction. Reinforcement learning emerges as a promising‌ technique for building autonomous GNCs for sailing robots,‌ but training the neural network with a real‌ sailboat is impractical due to long periods of‌ training and safety reasons. Even traditional control-based approaches‌ are mainly tested in simulated environments due to‌ the difficulties in building and operating a real‌ sailboat. The issue that arises is the fidelity‌ of these simulated environments. In this context, we‌ propose Yara, an oceanic virtual environment with a‌ reliable physics simulation for developing, training, and evaluating‌ autonomous agents to operate digital twins of sailing‌ robots in reinforcement learning and other paradigms. An‌ autonomous sailing robot digital twin is available within‌ the virtual environment, with the foil dynamics constructed‌ based on a real sailing robot. We coupled‌ these foil dynamics in Gazebo's physics engine to‌ compute the lift and drag forces acting on‌ the sail, rudder, and keel. The simulated world‌ feeds sensors such as cameras, wind sensors, and‌ GPS. The Robot Operating System communicates these sensors'‌ data through topics, facilitating users' implementation and testing‌ of new GNC solutions. Yara provides a reliable‌ solution for foil dynamic simulated physics that achieves‌ a simulation speedup of 300 times on an‌ i7 laptop with 8 GB of RAM, powered‌ by a Nvidia RTX 3060 and running Ubuntu‌ 20.04. With this speedup, it is possible to‌ complete a million time steps of deep reinforcement‌ learning training in approximately eight hours. Evaluation scenarios‌ were presented to highlight specific features of the‌ simulator, like the maneuverability of the sailing robot‌ digital twin and applications to train, evaluate, and‌ compare reinforcement learning agents and other control solutions.‌

Efficient Active Imitation Learning with Random Network Distillation‌, 27

Developing agents for complex and underspecified‌ tasks, where no clear objective exists, remains challenging‌ but offers many opportunities. This is especially true‌ in video games, where simulated players (bots) need‌ to play realistically, and there is no clear‌ reward to evaluate them. While imitation learning has‌ shown promise in such domains, these methods often‌ fail when agents encounter out-of-distribution scenarios during deployment.‌ Expanding the training dataset is a common solution,‌ but it becomes impractical or costly when relying‌ on human demonstrations. This article addresses active imitation learning, aiming to trigger‌ expert intervention only when‌ necessary, reducing the need‌‌ for constant expert input along training. We introduce‌ Random Network Distillation DAgger‌ (RND-DAgger), a new active‌‌ imitation learning method that limits expert querying by‌ using a learned state-based‌ out-of-distribution measure to trigger‌‌ interventions. This approach avoids frequent expert-agent action comparisons,‌ thus making the expert‌ intervene only when it‌‌ is useful. We evaluate RND-DAgger against traditional imitation‌ learning and other active‌ approaches in 3D video‌‌ games (racing and third-person navigation) and in a‌ robotic locomotion task and‌ show that RND-DAgger surpasses‌‌ previous methods by reducing expert queries. Link

Exploring‌ Flow-Lenia Universes with a‌ Curiosity-driven AI Scientist: Discovering‌‌ Diverse Ecosystem Dynamics, 37

We present a‌ method for the automated‌ discovery of system-level dynamics‌‌ in Flow-Lenia-a continuous cellular automaton (CA) with mass‌ conservation and parameter localization-using‌ a curiosity-driven AI scientist.‌‌ This method aims to uncover processes leading to‌ self-organization of evolutionary and‌ ecosystemic dynamics in CAs.‌‌ We build on previous work which uses diversity‌ search algorithms in Lenia‌ to find self-organized individual‌‌ patterns, and extend it to large environments are‌ that support distinct interacting‌ patterns. We adapt Intrinsically‌‌ Motivated Goal Exploration Processes (IMGEPs) to drive exploration‌ of diverse Flow-Lenia environments‌ using simulation-wide metrics, such‌‌ as evolutionary activity, compression-based complexity, and multi-scale entropy.‌ We test our method‌ in two experiments, showcasing‌‌ its ability to illuminate significantly more diverse dynamics‌ compared to random search.‌ We show qualitative results‌‌ illustrating how ecosystemic simulations enable self-organization of complex‌ collective behaviors not captured‌ by previous individual pattern‌‌ search and analysis. We complement automated discovery with‌ an interactive exploration tool,‌ creating an effective human-AI‌‌ collaborative workflow for scientific investigation. Though demonstrated specifically‌ with Flow-Lenia, this methodology‌ provides a framework potentially‌‌ applicable to other parameterizable complex systems where understanding‌ emergent collective properties is‌ of interest.

7.4.1 Responsible‌‌ AI and Algorithmic Auditing

Active Fourier Auditor for‌ Estimating Distributional Properties of‌ ML Models, 22‌‌

With the pervasive deployment of Machine Learning (ML)‌ models in real-world applications,‌ verifying and auditing properties‌‌ of ML models have become a central concern.‌ In this work, we‌ focus on three properties:‌‌ robustness, individual fairness, and group fairness. We discuss‌ two approaches for auditing‌ ML model properties: estimation‌‌ with and without reconstruction of the target model‌ under audit. Though the‌ first approach is studied‌‌ in the literature, the second approach remains unexplored.‌ For this purpose, we‌ develop a new framework‌‌ that quantifies different properties in terms of the‌ Fourier coefficients of the‌ ML model under audit‌‌ but does not parametrically reconstruct it. We propose‌ the Active Fourier Auditor‌ (AFA), which queries sample‌‌ points according to the Fourier coefficients of the‌ ML model, and further‌ estimates the properties. We‌‌ derive high probability error bounds on AFA's estimates,‌ along with the worst-case‌ lower bounds on the‌‌ sample complexity to audit them. Numerically we demonstrate‌ on multiple datasets and‌ models that AFA is‌‌ more accurate and sample-efficient‌ to estimate the properties of interest than the‌ baselines.

When Witnesses Defend: A Witness Graph Topological‌ Layer for Adversarial Graph Learning, 23

Capitalizing‌ on the intuitive premise that shape characteristics are‌ more robust to perturbations, we bridge adversarial graph‌ learning with the emerging tools from computational topology,‌ namely, persistent homology representations of graphs. We introduce‌ the concept of witness complex to adversarial analysis‌ on graphs, which allows us to focus only‌ on the salient shape characteristics of graphs, yielded‌ by the subset of the most essential nodes‌ (i.e., landmarks), with minimal loss of topological information‌ on the whole graph. The remaining nodes are‌ then used as witnesses, governing which higher-order graph‌ substructures are incorporated into the learning process. Armed‌ with the witness mechanism, we design Witness Graph‌ Topological Layer (WGTL), which systematically integrates both local‌ and global topological graph feature representations, the impact‌ of which is, in turn, automatically controlled by‌ the robust regularized topological loss. Given the attacker's‌ budget, we derive the important stability guarantees of‌ both local and global topology encodings and the‌ associated robust topological loss. We illustrate the versatility‌ and efficiency of WGTL by its integration with‌ five GNNs and three existing non-topological defense mechanisms.‌ Our extensive experiments across six datasets demonstrate that‌ WGTL boosts the robustness of GNNs across a‌ range of perturbations and against a range of‌ adversarial attacks, leading to relative gains of up‌ to 18%.

Sublinear Algorithms for Wasserstein and Total‌ Variation Distances: Applications to Fairness and Privacy Auditing‌, 16

Resource-efficiently computing representations of probability distributions‌ and the distances between them while only having‌ access to the samples is a fundamental and‌ useful problem across mathematical sciences. In this paper,‌ we propose a generic algorithmic framework to estimate‌ the PDF and CDF of any sub-Gaussian distribution‌ while the samples from them arrive in a‌ stream. We compute mergeable summaries of distributions from‌ the stream of samples that require sublinear space‌ w.r.t. the number of observed samples. This allows‌ us to estimate Wasserstein and Total Variation (TV)‌ distances between any two sub-Gaussian distributions while samples‌ arrive in streams and from multiple sources (e.g.‌ federated learning). Our algorithms significantly improves on the‌ existing methods for distance estimation incurring super-linear time‌ and linear space complexities. In addition, we use‌ the proposed estimators of Wasserstein and TV distances‌ to audit the fairness and privacy of the‌ ML algorithms. We empirically demonstrate the efficiency of‌ the algorithms for estimating these distances and auditing‌ using both synthetic and real-world datasets.

The Fair‌ Game: Auditing & debiasing AI algorithms over time‌, 17

Abstract An emerging field of AI,‌ namely Fair Machine Learning (ML), aims to quantify‌ different types of bias (also known as unfairness)‌ exhibited in the predictions of ML algorithms, and‌ to design new algorithms to mitigate them. Often,‌ the definitions of bias used in the literature‌ are observational, i.e. they use the input and output of a pre-trained‌ algorithm to quantify a‌ bias under concern. In‌‌ reality, these definitions are often conflicting in nature‌ and can only be‌ deployed if either the‌‌ ground truth is known or only in retrospect‌ after deploying the algorithm.‌ Thus, there is a‌‌ gap between what we want Fair ML to‌ achieve and what it‌ does in a dynamic‌‌ social environment. Hence, we propose an alternative dynamic‌ mechanism, "Fair Game", to‌ assure fairness in the‌‌ predictions of an ML algorithm and to adapt‌ its predictions as the‌ society interacts with the‌‌ algorithm over time. "Fair Game" puts together an‌ Auditor and a Debiasing‌ algorithm in a loop‌‌ around an ML algorithm. The "Fair Game" puts‌ these two components in‌ a loop by leveraging‌‌ Reinforcement Learning (RL). RL algorithms interact with an‌ environment to take decisions,‌ which yields new observations‌‌ (also known as data/feedback) from the environment and‌ in turn, adapts future‌ decisions. RL is already‌‌ used in algorithms with pre-fixed long-term fairness goals.‌ "Fair Game" provides a‌ unique framework where the‌‌ fairness goals can be adapted over time by‌ only modifying the auditor‌ and the different biases‌‌ it quantifies. Thus, "Fair Game" aims to simulate‌ the evolution of ethical‌ and legal frameworks in‌‌ the society by creating an auditor which sends‌ feedback to a debiasing‌ algorithm deployed around an‌‌ ML system. This allows us to develop a‌ flexible and adaptive-over-time framework‌ to build Fair ML‌‌ systems pre- and post-deployment.

DP-SPRT: Differentially Private Sequential‌ Probability Ratio Tests,‌ 36

We revisit Wald's‌‌ celebrated Sequential Probability Ratio Test for sequential tests‌ of two simple hypotheses,‌ under privacy constraints. We‌‌ propose DP-SPRT, a wrapper that can be calibrated‌ to achieve desired error‌ probabilities and privacy constraints,‌‌ addressing a significant gap in previous work. DP-SPRT‌ relies on a private‌ mechanism that processes a‌‌ sequence of queries and stops after privately determining‌ when the query results‌ fall outside a predefined‌‌ interval. This OutsideInterval mechanism improves upon naive composition‌ of existing techniques like‌ AboveThreshold, potentially benefiting other‌‌ sequential algorithms. We prove generic upper bounds on‌ the error and sample‌ complexity of DP-SPRT that‌‌ can accommodate various noise distributions based on the‌ practitioner's privacy needs. We‌ exemplify them in two‌‌ settings: Laplace noise (pure Differential Privacy) and Gaussian‌ noise (Rényi differential privacy).‌ In the former setting,‌‌ by providing a lower bound on the sample‌ complexity of any $ϵ‌$ -DP test with prescribed‌‌ type I and type II errors, we show‌ that DP-SPRT is near‌ optimal when both errors‌‌ are small and the two hypotheses are close.‌ Moreover, we conduct an‌ experimental study revealing its‌‌ good practical performance.

Some Targets Are Harder to‌ Identify than Others: Quantifying‌ the Target-dependent Membership Leakage‌‌, 24

In a Membership Inference (MI) game,‌ an attacker tries to‌ infer whether a target‌‌ point was included or not in the input‌ of an algorithm. Existing‌ works show that some‌‌ target points are easier‌ to identify, while others are harder. This paper‌ explains the target-dependent hardness of membership attacks by‌ studying the powers of the optimal attacks in‌ a fixed-target MI game. We characterise the optimal‌ advantage and trade-off functions of attacks against the‌ empirical mean in terms of the Mahalanobis distance‌ between the target point and the data-generating distribution.‌ We further derive the impacts of two privacy‌ defences, i.e. adding Gaussian noise and sub-sampling, and‌ that of target misspecification on optimal attacks. As‌ by-products of our novel analysis of the Likelihood‌ Ratio (LR) test, we provide a new covariance‌ attack which generalises and improves the scalar product‌ attack. Also, we propose a new optimal canary-choosing‌ strategy for auditing privacy in the white-box federated‌ learning setting. Our experiments validate that the Mahalanobis‌ score explains the hardness of fixed-target MI games.‌

Dimension Agnostic Testing of Survey Data Credibility through‌ the Lens of Regression, 46

Assessing whether‌ a sample survey credibly represents the population is‌ a critical question for ensuring the validity of‌ downstream research. Generally, this problem reduces to estimating‌ the distance between two high-dimensional distributions, which typically‌ requires a number of samples that grows exponentially‌ with the dimension. However, depending on the model‌ used for data analysis, the conclusions drawn from‌ the data may remain consistent across different underlying‌ distributions. In this context, we propose a task-based‌ approach to assess the credibility of sampled surveys.‌ Specifically, we introduce a model-specific distance metric to‌ quantify this notion of credibility. We also design‌ an algorithm to verify the credibility of survey‌ data in the context of regression models. Notably,‌ the sample complexity of our algorithm is independent‌ of the data dimension. This efficiency stems from‌ the fact that the algorithm focuses on verifying‌ the credibility of the survey data rather than‌ reconstructing the underlying regression model. Furthermore, we show‌ that if one attempts to verify credibility by‌ reconstructing the regression model, the sample complexity scales‌ linearly with the dimensionality of the data. We‌ prove the theoretical correctness of our algorithm and‌ numerically demonstrate our algorithm's performance.

7.4.2 Formalization of‌ mathematics

Formalization of Brownian motion in Lean,‌ 49

Brownian motion is a building block in‌ modern probability theory. In this paper, we describe‌ a formalization of Brownian motion using the Lean‌ theorem prover. We build on the existing measure-theoretic‌ foundations in Lean's mathematical library, Mathlib, and we‌ develop several key components needed for the construction‌ of Brownian motion, including the Carathéodory and Kolmogorov‌ extension theorems, Gaussian measures in Banach spaces, and‌ the Kolmogorov-Chentsov theorem for path continuity.

Markov kernels‌ in Mathlib's probability library, 50

The probability‌ folder of Mathlib, Lean's mathematical library, makes a‌ heavy use of Markov kernels. We present their‌ definition and properties and describe the formalization of‌ the disintegration theorem for Markov kernels. That theorem‌ is used to define conditional probability distributions of‌ random variables as well as posterior distributions. We then explain how Markov‌ kernels are used in‌ a more unusual way‌‌ to get a common definition of independence and‌ conditional independence and, following‌ the same principles, to‌‌ define sub-Gaussian random variables. Finally, we also discuss‌ the role of kernels‌ in our formalization of‌‌ entropy and Kullback-Leibler divergence.

8 Bilateral contracts and‌ grants with industry

Participants:‌ Odalric-Ambrym Maillard, Philippe‌‌ Preux, Mickael Basson, Yann Berthelot,‌ Anthony Kobanda.

8.1‌ Bilateral contracts with industry‌‌

contract with Ubisoft, 2023–2026: PI: Odalric-Ambrym Maillard .‌

This contract is related‌ to Anthony Kobanda s‌‌ Ph.D. “Continual Reinforcement Learning with changing environments: Application‌ to Video Games”
contract‌ with Lilly Group, 2023–2026,‌‌ PI: Philippe Preux .

This contract is related‌ to Mickael Basson 's‌ Ph.D. “Reinforcement learning to‌‌ solve combinatorial optimization problems”.
contract with Saint-Gobain Research,‌ 2023–2026, PI: Philippe Preux‌ .

This contract is‌‌ related to Yann Berthelot 's Ph.D. “Reinforcement learning‌ for advanced control of‌ industrial processus”.

9 Partnerships‌‌ and cooperations

Participants: Philippe Preux, Odalric-Ambrym Maillard‌, Emilie Kaufmann,‌ Remy Degenne, Debabrota‌‌ Basu, Riadh Akrour, Timothee Mathieu,‌ Hector Kohler, Penanklihi‌ Cyrille Kone.

9.1‌‌ International initiatives

9.1.1 Inria associate team not involved‌ in an IIL or‌ an international program

SeRAI‌‌

Title:
Sequential Testing and Learning Algorithms for Verifiably‌ Robust and Responsible AI‌
Duration:
2025 -> 2027‌‌
Coordinator:
Arijit Ghosh (arijitiitkgpster@gmail.com)
Partners:
- Indian Statistical Institute‌ , Calcutta (Inde)
Inria‌ contact:
Debabrota Basu
Summary:‌‌
Artificial Intelligence (AI) and Machine Learning (ML) have‌ emerged as the technologies‌ of our times, and‌‌ presently, they are getting widely deployed in real-life‌ applications with socioeconomic consequences.‌ The reckoning of AI/ML‌‌ has motivated development of robust and responsible AI‌ algorithms to ensure social‌ alignment of them, and‌‌ also auditing algorithms to verify different properties of‌ the AI/ML systems before‌ and after deployment. Instead‌‌ of the plethora of existing robust and responsible‌ AI/ML algorithms, recent research‌ has exposed a gap‌‌ between what existing algorithms achieve in terms of‌ a socioeconomic indicator of‌ interest and what we‌‌ socially want them to achieve. Internationally, it has‌ led to a surge‌ in verifying and auditing‌‌ different properties of the ML and AI algorithms,‌ where EU AI regulations‌ push the frontiers. Distribution‌‌ and property testing play a pivotal role in‌ ensuring ethical and transparent‌ AI operations, aligning with‌‌ established guidelines and societal norms. These tests facilitate‌ the scrutiny of data‌ distributions used in AI‌‌ model training, critical for upholding ethical and fairness‌ standards. They further allow‌ verification of biases and‌‌ other unintended behaviours that may lead to adverse‌ consequences. Presently, there are‌ two complementary approaches to‌‌ these testing problems: (a) statistical learning based methods,‌ like using online learning,‌ sequential tests, active learning‌‌ etc., and (b) complexity theory-based and formal methods,‌ like query and memory‌ complexities of testing Boolean‌‌ functions and properties of graphs, and distributions corresponding‌ to their input-output domains.‌ While Scool team is‌‌ known for their expertise‌ in the statistics-based approach, the Indian Statistical Institute‌ team is known for their expertise in the‌ other one. In SeRAI, our scientific aim is‌ to marry these two complementary approaches (statistics and‌ formal methods) to AI and harness their strengths‌ to create verifiable and responsible but also efficient‌ AI/ML algorithms. Specifically, we aim to explore three‌ research directions. 1. Sample-efficient auditing bias of real-life‌ survey data from the lens of different learning‌ models with and without privacy constraints. 2. Statistically‌ and computationally efficient auditing of different statistical assumptions‌ on structured distributions required to design theoretically provable‌ ML and RL algorithms. 3. Statistically and computationally‌ efficient auditing of properties of large-scale ML models‌ with high-dimensional data under structures, like symmetry, sparsity,‌ and permutation-invariance. Thanks to the complementary expertise of‌ the partners, the SeRAI associate team aims to‌ contribute to the aforementioned three problems and design‌ statistically and computationally efficient distribution and property testing‌ algorithms under sequential interactions, which are frugal and‌ practical in the pursuit of a verifiable and‌ responsible AI.

9.2 International research visitors

9.2.1 Visits‌ of international scientists

Other international visits to the‌ team

Ahana Deb

Status
PhD
Institution of origin:‌
University of Pompeu Fabra
Country:
Spain
Dates:
April‌ to July 2025
Context of the visit:
Many‌ real world AI applications involve the use of‌ Reinforcement Learning (RL), such as medical trials, drug‌ design, recommendation systems, automated robots, self-driving cars, etc.‌ However most of the RL problems are modelled‌ on a Markovian assumption which does not account‌ for any dependency on the history of the‌ agent-environment interaction. But such dependencies are often found‌ in many real life scenarios, e.g., patient histories‌ (including symptoms, test results, previous treatments, etc) always‌ play a crucial role in in diagnosis and‌ consequent treatment. During her visit, we studied the‌ challanges and solutions of designing a computationally-statistically efficient‌ algorithm for these problems and deriving the corresponsing‌ convergence analysis.
Mobility program/type of mobility:
Research visit‌ funded by ELIAS PhD mobility grant.

Pratik Karmakar‌

Status
PhD
Institution of origin:
National University of‌ Singapore
Country:
Singapore
Dates:
April 2025
Context of‌ the visit:
Pratik presented his work with Pierre‌ Senellart (ENS, Paris) on ProvSQL: Provenance and Probabilistic‌ Querying in Uncertain Databases.
Mobility program/type of mobility:‌
Research visit.

9.2.2 Visits to international teams

Research‌ stays abroad

Hector Kohler

Visited institution:
University of‌ ALberta
Country:
Canada
Dates:
Jul-Aug.
Context of the‌ visit:
Hector visited the RLAI Lab, a highly‌ recognized lab in reinforcement learning.
Mobility program/type of‌ mobility:
research stay.

Cyrille Koné

Visited institution:
University‌ of Washington
Country:
United States
Dates:
June-August
Context‌ of the visit:
Cyrille visited the group of‌ Kevin Jamieson to work on best policy identification‌ in reinforcement learning.
Mobility program/type of mobility:
research‌ stay founded by UW.

9.3 National initiatives

9.3.1‌ ANR projects

Scool is involved in 5 ANR‌ projects:

ANR JCJC FATE, PI: Remy Degenne‌ , 2023–2027
ANR JCJC REPUBLIC, PI: Debabrota Basu , 2023–2026
ANR‌ BIP-UP, partnership: Scool/Inserm‌ (CHU de Lille), PI:‌‌ Adrien Prevost , 2023–2026.
ANR JCJC NeuRL, PI:‌ Riadh Akrour , 2024–2028‌
ANR JCJC STRESS, PI:‌‌ Timothee Mathieu , 2025–2029

9.3.2 PEPR projects

Scool‌ is involved in 2‌ PEPR:

PEPR AI: project‌‌ FOUNDRY, local head: Emilie Kaufmann (description below);
PEPR‌ « Agroécologie et numérique‌ », Pl@‌‌ntAgroEco, local head: Odalric-Ambrym Maillard

Title:
FOUNDRY‌
Duration:
July 2024 $\to‌$ June 2028
Coordinator:
Panayotis‌‌ Mertikopoulous, Polaris, Univ. Grenoble Alpes
Partners:
- POLARIS: a‌ joint research team between‌ the CNRS, Inria, and‌‌ Univ. Grenoble Alpes.
- ENS Lyon: faculty from the‌ pure and applied mathematics‌ department of ENS Lyon.‌‌
- Inria FAIRPLAY: a joint team between Criteo, IP‌ Paris (ENSAE and Ecole‌ Polytechnique), and Inria.
- LTCI:‌‌ the informations and communications laboratory of Télécom Paris.‌
- MILES: the machine intellligence‌ and learning systems of‌‌ the LAMSADE lab at Paris Dauphine.
- Inria Scool‌
Inria contact:
Emilie Kaufmann‌
Summary:
From automated hospital‌‌ admission systems powered by machine learning (ML), to‌ flexible chatbots capable of‌ fluent conversations and self-driving‌‌ cars, the wildfire spread of artificial intelligence (AI)‌ has brought to the‌ forefront a crucial question‌‌ with far-reaching ramifications for the society at large:‌ Can ML systems and‌ models be relied upon‌‌ to provide trustworthy output in high-stakes, mission- critical‌ environments? These questions invariably‌ revolve around the notion‌‌ of robustness, an operational desideratum that has‌ eluded the field since‌ its nascent stages. One‌‌ of the main reasons for this is the‌ fact that ML models‌ and systems are typically‌‌ data-hungry and highly sensitive to their training input,‌ so they tend to‌ be brittle, narrow-scoped, and‌‌ unable to adapt to situations that go beyond‌ their training envelope. On‌ that account, the core‌‌ vision of the proposed research is that robustness‌ cannot be achieved by‌ blindly throwing more data‌‌ and computing power to larger and larger models‌ with exponentially growing energy‌ requirements (and a commensurate‌‌ carbon footprint to boot). Instead, our proposal intends‌ to rethink and develop‌ the core theoretical and‌‌ methodological FOUNDations of Robustness and reliabilitY (FOUNDRY) that‌ are needed to build‌ and instill trust in‌‌ ML-powered technologies and systems from the ground up.‌

Title:
Pl@ntAgroEco
Duration:
July‌ 2024 $\to$ June 2028‌‌
Coordinator:
Alexis Joly, Inria Zenith, and Pierre Bonnet‌ CIRAD, AMAP.
Partners:
- INRAE‌
- INRIA
- IRD
- CIRAD
- Tela‌‌ Botanica
- Université de Montpellier
- Université Paris-Saclay
Inria contact:‌
Odalric-Ambrym Maillard
Summary:
Agroecology‌ necessarily involves crop diversification,‌‌ but also the early detection of diseases, deficiencies‌ and stresses (hydric, etc.),‌ as well as better‌‌ management of biodiversity. The main stumbling block is‌ that this paradigm shift‌ in agricultural practices requires‌‌ expert skills in botany, plant pathology and ecology‌ that are not generally‌ available to those working‌‌ in the field, such as farmers or agri-food‌ technicians. Digital technologies, and‌ artificial intelligence in particular,‌‌ can play a crucial role in overcoming this‌ barrier to access to‌ knowledge.

The aim of‌‌ the Pl@ntAgroEco‌ project will be to design, experiment with and‌ develop new high-impact agro-ecology services within the Pl@ntNet‌ platform. This includes :
- research in AI and‌ plant sciences ;
- agile development of new components‌ within the platform;
- organization of participatory science programs‌ and animation of the Pl@ntNet user‌ community.
Ce programme de travail a pour but‌ de produire une amélioration de la détection et‌ reconnaissance des maladies végétales, de l'identification des niveaux‌ infraspécifiques. Il permettra le développement d'outils d'estimation de‌ la sévérité des symptômes, carences, stades de déclin‌ et stress hydrique ou de caractérisation des associations‌ d'espèces à partir d'images multi-spécimens. Il améliorera la‌ connaissance des espèces.

Le projet Pl@ntAgroEco rassemble des‌ forces complémentaires en matière de recherche, de développement‌ et d'animation. S'ajouteront à l'équipe pluridisciplinaire chargée de‌ la plateforme Pl@ntNet de nouvelles forces de recherche‌ ayant une expertise reconnue dans les sciences participatives.‌ Le consortium rassemblera 10 partenaires incluant des organismes‌ de recherche, des universités, des acteurs de la‌ société civile et des partenaires internationaux.

9.3.3 Other‌ projects in France

Scool is involved in the‌ Regalia pilot-project.

Other collaborations:

L. Richert, R. Thiébaut,‌ Inria SISTM, Bordeaux, bandits for vaccine clinical trials.‌
W. M. Koolen, CWI Amsterdan & University of‌ Twente, concentration of information divergences.

9.3.4 Inria Exploratory‌ Actions

Emilie Kaufmann obtained an Action Exploratoire grant‌ BETA-3K (Bandit Exploration for Treatment Allocation in Phase‌ III Trials with $K > 2$ arms) in‌ 2025 to start a collaboration with the University‌ of Cambridge (MRC Biostatistics Unit, team of Sofìa‌ Villar) on adaptive clinical trials.

Effective start of‌ the Action Exploratoire AuDaCiTi (Autonomous Data Collection and‌ Labelling Through Interaction) with the hiring of Hadrien‌ Crassous as a PhD student in November 2025,‌ under the supervision of Riadh Akrour . AEx‌ in collaboration with the Robot Learning group of‌ Joni Pajarinen at Aalto University.

Remy Degenne obtained‌ an Action Exploratoire grant FORMAL (formal proofs for‌ machine learning).

10 Dissemination

Participants: Philippe Preux,‌ Odalric-Ambrym Maillard, Emilie Kaufmann, Remy Degenne‌, Debabrota Basu, Riadh Akrour, Timothee‌ Mathieu, Juliette Achddou, Julien Teigny,‌ Adrienne Tuynman.

10.1 Promoting scientific activities

10.1.1‌ Scientific events: organisation

Debabrota Basu co-organized with A.‌ Gilra of CWI Amsterdam a research semester program‌ on ¨Control Theory and Reinforcement Learning: Connections‌ and Challenges¨. It included
1. Spring School on‌ Control Theory and Reinforcement Learning, 17-21 March, 2025.‌
2. Workshop on Themes across Control and Reinforcement Learning,‌ 24-25 March, 2025.
3. Workshop on Modern Applications of‌ Control Theory and Reinforcement Learning, 20-21 May, 2025.‌
4. Workshop on Theory of Control and Reinforcement Learning,‌ 19-20 June, 2025.
Odalric-Ambrym Maillard : Organization of‌ “Séminaire Itinérant” of the AI transversal Axis of‌ the CRIStaL laboratory, together with M. Keller (Inria‌ Magnet)

10.1.2 Scientific events: selection

Member of the‌ conference program committees

Debabrota Basu : member of‌ the PC at AAAI, PETS, CCS.
Remy Degenne‌ : member of the PC at ALT.
Emilie Kaufmann : member of‌ the PC at COLT.‌
Philippe Preux : member‌‌ of the PC at IJCAI and ECML.

Reviewer‌

Juliette Achddou : reviewer‌ at NeurIPS
Riadh Akrour‌‌ : reviewer at ICML, NeurIPS, ICLR
Debabrota Basu‌ : reviewer at ICML,‌ NeurIPS, AISTATS, AAAI, EWRL,‌‌ AAMAS.
Remy Degenne : reviewer at COLT, ICML,‌ ALT
Emilie Kaufmann :‌ reviewer at AISTATS, NeurIPS.‌‌
Odalric-Ambrym Maillard : reviewer at RLC, ACML, EWRL.‌
Timothee Mathieu : reviewer‌ at COLT, AISTATS and‌‌ EWRL

10.1.3 Journal

Reviewer - reviewing activities

Debabrota‌ Basu : reviewer for‌ JMLR, TMLR, IEEE TPAMI,‌‌ IEEE Access, ACM Journal on Responsible Computing, IEEE‌ TAI, IEEE TKDE, IEEE‌ Information Theory, Communications in‌‌ Statistics – Theory and Methods.
Remy Degenne :‌ reviewer for JMLR, Annals‌ of Formalized Mathematics.
Emilie‌‌ Kaufmann : reviewer for JMLR.
Odalric-Ambrym Maillard :‌ reviewer for JMLR, Mathematics‌ of Operation Research.
Timothee‌‌ Mathieu : reviewer for Biometrika, Annals of Statistics,‌ ALEA, JMLR, JRSSB.

10.1.4‌ Invited talks

Debabrota Basu‌‌ : Exploration–Exploitation Dilemma in RL: Bridging Theory-to-Practice Gaps,‌ Centre for AI, IIIT‌ Delhi, September 2025.
Debabrota‌‌ Basu : When Privacy meets Partial Information: Privacy-Utility‌ Trade-offs in Bandits, CNI‌ Seminars, IISc Bangalore, September‌‌ 2025.
Debabrota Basu : Actors & Critics: Function‌ Approximation & Policy Gradients‌ in RL, Spring School‌‌ on Control Theory & RL, CWI Amsterdam, March‌ 2025.
Debabrota Basu :‌ Exploration–Exploitation in RL: Calibrated‌‌ Optimism in Face of Uncertainty, Spring School on‌ Control Theory & RL,‌ Amsterdam, March 2025.
Remy‌‌ Degenne : Lean for PDEs workshop, Simons Laufer‌ Mathematical Institute, Berkeley, California,‌ October 2025.
Remy Degenne‌‌ : Stochastic Analysis and Mathematical Finance seminar, Oxford,‌ November 2025.
Remy Degenne‌ : ItaLean conference on‌‌ Bridging Formal Mathematics and AI, Bologna, December 2025.‌
Remy Degenne : Imperial‌ Lean study group, London,‌‌ December 2025.
Emilie Kaufmann : Colloquium Polaris, Lille,‌ February 2025.
Emilie Kaufmann‌ : ENSAE Statistics seminar,‌‌ Saclay, May 2025.
Emilie Kaufmann : Colloquium of‌ the MAP 5, Paris,‌ June 2025.
Emilie Kaufmann‌‌ : RL Theory Workshop, CWI, Amsterdam, June 2025.‌
Emilie Kaufmann : Workshop‌ on Regret, Optimization and‌‌ Games, IHP, Paris, November 2025.
Emilie Kaufmann :‌ Symposium de l’Association d’Informatique‌ Medicale (AIM), Lille, November‌‌ 2025.
Emilie Kaufmann : Algorithmic Statistics Workshop, Oxford,‌ November 2025.
Odalric-Ambrym Maillard‌ : talk at Colloque‌‌ L'Agriculture au prisme des data-sciences organized by Alliance‌ Harvest, regarding Pl@ntAgroEco project‌ and collaboration in Agroecology,‌‌ Palaiseau, February 2025.
Odalric-Ambrym Maillard : talk at‌ Inria Breizh-Carnot festival, Rennes,‌ November 2025.

10.1.5 Scientific‌‌ expertise

Odalric-Ambrym Maillard : evaluation of an ANR‌ JCJC project.
Philippe Preux‌ :
- evaluation of 2‌‌ ANR project proposals.
- member of the scientific committee‌ of the MathNum department‌ at Inrae
- member of‌‌ the scientific committee of PEPR agroécologie et numérique‌
- member of the scientific‌ and ethical committee of‌‌ INCLUDE (data warehouse of CHU Lille).
- member of‌ the éthique en commun‌ joint committee of Inrae,‌‌ IRD, Ifremer, and Cirad.
Adrienne Tuynman : evaluation‌ of bachelor programs of‌ the University of Science‌‌ and Technology of Hanoi,‌ Vietnam, as part of a committee of the‌ international section of HCERES

10.1.6 Research administration

Odalric-Ambrym‌ Maillard
- Scientific coordinator of AI transversal Axis of‌ CRIStaL laboratory, together with M. Keller (Inria Magnet)‌
Philippe Preux :
- scientific coordinator of CPER CornelIA‌
- member of the DAS1 health at Région‌ Hauts-de-France
- member of the “Hub Santé”' board of‌ the “Intiative d'excellence”'
- member of the BCEP at‌ Inria Lille.

10.2 Teaching - Supervision - Juries‌ - Educational and pedagogical outreach

Juliette Achddou :‌ “Introduction au Deep Learning”, M2 Informatique et Statistiques,‌ Polytech' Lille.
Juliette Achddou : “Machine Learning”, M1‌ Informatique et Statistiques, Polytech' Lille.
Juliette Achddou :‌ “Algorithmique numérique pour l'optimisation”, M1 Informatique et Statistiques,‌ Polytech' Lille.
Juliette Achddou : “Classification supervisée”, L3‌ Informatique et Statistiques, Polytech' Lille.
Juliette Achddou :‌ “Régression Linéaire”, L3 Informatique et Statistiques, Polytech' Lille.‌
Juliette Achddou : “Data Mining”, L3 Informatique et‌ Statistiques, Polytech' Lille.
Juliette Achddou : “Algèbre Linéaire‌ Numérique”, L3 Informatique et Statistiques, Polytech' Lille.
Juliette‌ Achddou : “Introduction aux Environnements Virtuels”, M1 Informatique‌ et Statistiques, Polytech' Lille.
Riadh Akrour : “Option‌ Machine Learning”, L3 Informatique, Université de Lille.
Riadh‌ Akrour : “Sequential Decision Making”, M2 Data Science,‌ Ecole Centrale de Lille.
Debabrota Basu : “Sequential‌ Decision Making”, M2 in Data Science, Centrale Lille‌ and Université de Lille.
Debabrota Basu : “Research‌ Reading Group”, M2 in Data Science, Centrale Lille‌ and Université de Lille.
Debabrota Basu : “Advanced‌ Machine Learning and Decision Making”, Centrale Lille
Remy‌ Degenne : “Sequential Learning”, Master MVA, ENS Paris-Saclay.‌
Remy Degenne : “Sequential Learning”, Centrale Lille.
Emilie‌ Kaufmann : “Statistics 2”, M1 Data Science, Ecole‌ Centrale de Lille.
Odalric-Ambrym Maillard : “Reinforcement Learning‌ Research Challenges”, Executive Master Ecole Polytechnique.
Odalric-Ambrym Maillard‌ : “Reinforcement Learning for the Industry”, Inria Academy.‌
Philippe Preux : “Prise de décision séquentielle dans‌ l'incertain”, M2 in Computer Science, Université de Lille.‌
Philippe Preux : “Apprentissage par renforcement”, M2 in‌ Computer Science, Université de Lille.
Philippe Preux :‌ “Science des données II”, L3 MIASHS, Université de‌ Lille.
Philippe Preux : “Science des données III”,‌ L3 MIASHS, Université de Lille.
Philippe Preux :‌ “Réseaux de neurones”, L1 Maths-Informatique, Université de Lille.‌
Philippe Preux : “IA et apprentissage automatique”, DU‌ IA & Santé, Université de Lille.
Adrienne Tuynman‌ : “Cryptographie” (practical sessions), M1 in Applied Mathematics,‌ University of Lille

10.2.1 Supervision

Riadh Akrour :‌
- Ph.D. students: Brahim Driss , Hadrien Crassous
- M2‌ Research Internship: Francois Muller
Debabrota Basu and Emilie‌ Kaufmann : Ph.D. student: Thomas Michel
Debabrota Basu‌ and Odalric-Ambrym Maillard : Ph.D. student: Udvas Das‌
Remy Degenne and Emilie Kaufmann : Ph.D. students:‌ Redouane Yagouti , Adrienne Tuynman
Emilie Kaufmann :‌ Ph.D. student: Penanklihi Cyrille Kone
Timothee Mathieu and‌ Odalric-Ambrym Maillard : Ph.D. student: Adrien Prevost
Odalric-Ambrym‌ Maillard : Ph.D. students: Sumit Vashishtha , Anthony‌ Kobanda , Waris Radji .
Philippe Preux :‌
- Ph.D. students: Matheus Medeiros Centa , Mickael Basson‌
Philippe Preux and Riadh Akrour : Ph.D. students: Hector Kohler , Yann‌ Berthelot
Philippe Preux and‌ Emilie Kaufmann : Ph.D.‌‌ student: Thomas Michel
Philippe Preux and Debabrota Basu‌ : Ph.D. students: Ayoub‌ Ajarra

10.2.2 Juries

Riadh‌‌ Akrour : Ph.D. defenses: Hector Kohler (supervisor)
Debabrota‌ Basu :
- Ph.D. defense:‌ Y. Wang (Inria Lille‌‌ and Orange)
- CS: G. Richardeau (Inria Rennes, Univ.‌ Rennes)
Emilie Kaufmann :‌
- Ph.D. defenses: A. Rio‌‌ (Université de Grenoble, reviewer), D. Tiapkin (CMAP, Ecole‌ Polytechnique, reviewer), C. Fiegel‌ (ENSAE), R. Zhang (LSS,‌‌ CentraleSupélec, reviewer), A. Gouverneur (KTH).
Odalric-Ambrym Maillard :‌
- HdR defense: R. Combes‌ (Centrale-Supélec, Palaiseau, rapporteur),
- Ph.D.‌‌ defenses: F. Morri (Inria, U. Lille, examinateur), O.‌ Rossini (IMAG, U. Montpellier,‌ examinateur), S. Lindstahl (KTH,‌‌ Stockholm, opponent), G.J. Molina (UPF, Barcelona, examinateur), F.‌ Fabre (U. Reunion, co-superviseur).‌
- CSI: Q.L Ta. (Intitut‌‌ Polytechnique de Paris, rapporteur).
- CR hiring committee of‌ the Mathnum department at‌ Inrae (23,24,25,26).
Philippe Preux‌‌ :
- Participation to a hiring committee for an‌ associate professor position in‌ computer science applied to‌‌ humanities at Paris-Sorbonne.
- Ph.D. defenses: Th. Firmin (Université‌ de Lille), P-A. Le‌ Tolguenec (Toulouse, reviewer), M.‌‌ Zouitine (Toulouse), Hector Kohler (supervisor), F. Fabre-Ferber (La‌ Réunion, reviewer)
- HdR defense:‌ Riadh Akrour .

10.2.3‌‌ Educational and pedagogical outreach

E. Kaufmann participated to‌ a table ronde "Femmes‌ et mathématiques" organized on‌‌ Pi Day (03/14) at LILLIAD and targetting undergrad‌ students at the University‌ of Lille

10.3 Popularization‌‌

10.3.1 Productions (articles, videos, podcasts, serious games, ...)‌

Timothee Mathieu : Video‌ for the Summer of‌‌ Maths Exposition on Sequential testing (link to‌ the video).
Emilie‌ Kaufmann participated to the‌‌ book “Tout comprendre (ou presque) sur l'intelligence artificielle”‌ by Olivier Cappé and‌ Claire Marc at CNRS‌‌ Editions (chapter 12).
Debabrota Basu drafted a book‌ chapter in the “‌AI Bias in Education‌‌” book, which got highlighted in Forbes’ 2025‌ reading list on AI.‌

10.3.2 Participation in Live‌‌ events

Emilie Kaufmann gave a presentation “Tout Comprendre‌ (ou presque) sur l'Intelligence‌ artificielle” and organized a‌‌ “table ronde” on the same topic for high-school‌ students and maths teacher‌ at LILLIAD during the‌‌ NSI (Numérique Science Informatique week) week (link‌). T. Michel (PhD‌ student) participated to the‌‌ table ronde.
Philippe Preux gave a presentation on‌ “AI, for the best,‌ and the worst” at‌‌ the “Fête de la science”
Julien Teigny :‌
- 1 hour presentation on‌ the “alignment of AIs”‌‌ to:
  - a group of ENS Rennes students visiting‌ Lille
  - teachers of the‌ “académie de Lille”
- presentation‌‌ of the bariatric surgery website to a group‌ of ENS Rennes students,‌
- presentation on AI using‌‌ a game to “collégiens 3è”,
- co-supervision of a‌ serious game jam on‌ “Equité en Intelligence Artificielle”‌‌ for people of INSPE (link to the‌ video),
Odalric-Ambrym Maillard‌ : Interview for France‌‌ Culture "8h45", regarding AI and agroecology, February 26‌ 2025.

10.3.3 Others science‌ outreach relevant activities

Philippe‌‌ Preux was interviewed by senator A. Basquin on‌ AI.

11 Scientific production‌

11.1 Major publications

1‌‌ inproceedingsB.Borja Balle‌ and O.-A.Odalric-Ambrym Maillard. Spectral Learning from‌ a Single Trajectory under Finite-State Policies.International‌ conference on Machine LearningProceedings of the International‌ conference on Machine LearningSidney, FranceJuly 2017‌HAL
2 inproceedingsD.Dorian Baudry, R.‌Romain Gautron, E.Emilie Kaufmann and O.-A.‌Odalric-Ambrym Maillard. Optimal Thompson Sampling strategies for‌ support-aware CVaR bandits.38th International Conference on‌ Machine Learningproceedings of machine learning researchVirtual,‌ United StatesJuly 2021HAL
3 inproceedingsL.‌Lilian Besson and E.Emilie Kaufmann. Multi-Player‌ Bandits Revisited.Algorithmic Learning TheoryMehryar Mohri‌ and Karthik SridharanLanzarote, SpainApril 2018HAL‌
4 articleG.Gabriel Dulac-Arnold, L.Ludovic‌ Denoyer, P.Philippe Preux and P.Patrick‌ Gallinari. Sequential approaches for learning datum-wise sparse‌ representations.Machine Learning891-2October 2012‌, 87-122HAL DOI
5 inproceedingsY.Yannis‌ Flet-Berliac and P.Philippe Preux. Only Relevant‌ Information Matters: Filtering Out Noisy Samples to Boost‌ RL.IJCAI 2020 - International Joint Conference‌ on Artificial IntelligenceYokohama, JapanJuly 2020HAL‌DOI
6 inproceedingsA.Aurélien Garivier and E.‌Emilie Kaufmann. Optimal Best Arm Identification with‌ Fixed Confidence.29th Annual Conference on Learning‌ Theory (COLT)49JMLR Workshop and Conference Proceedings‌New York, United StatesJune 2016HAL
7‌ inproceedingsB.Bishwamittra Ghosh, D.Debabrota Basu‌ and K. S.Kuldeep S. Meel. Justicia:‌ A Stochastic SAT Approach to Formally Verify Fairness‌.Proceedings of the AAAI Conference on Artificial‌ IntelligenceAAAI Conference on Artificial Intelligence35Proceedings‌ of the AAAI Conference on Artificial Intelligence9‌Virtual, CanadaFebruary 2021, 7554-7563HAL
8‌ inproceedingsM.Marc Jourdan, R.Rémy Degenne‌, D.Dorian Baudry, R.Rianne de‌ Heide and E.Emilie Kaufmann. Top Two‌ Algorithms Revisited.NeurIPS 2022 - 36th Conference‌ on Neural Information Processing SystemAdvances in Neural‌ Information Processing SystemsNew Orleans, United StatesNovember‌ 2022HAL
9 articleH.Hachem Kadri,‌ E.Emmanuel Duflos, P.Philippe Preux,‌ S.Stéphane Canu, A.Alain Rakotomamonjy and‌ J.Julien Audiffren. Operator-valued Kernels for Learning‌ from Functional Response Data.Journal of Machine‌ Learning Research17202016, 1-54HAL‌
10 inproceedingsE.Emilie Kaufmann and W. M.‌Wouter M. Koolen. Monte-Carlo Tree Search by‌ Best Arm Identification.NIPS 2017 - 31st‌ Annual Conference on Neural Information Processing SystemsAdvances‌ in Neural Information Processing SystemsLong Beach, United‌ StatesDecember 2017, 1-23HAL
11 inproceedings‌E.Emilie Kaufmann, P.Pierre Ménard,‌ O.Omar Darwiche Domingues, A.Anders Jonsson‌, E.Edouard Leurent and M.Michal Valko‌. Adaptive reward-free exploration.Algorithmic Learning Theory‌Paris, France2021HAL
12 articleO.-A.Odalric-Ambrym‌ Maillard. Boundary Crossing Probabilities for General Exponential‌ Families.Mathematical Methods of Statistics272018‌HAL
13 inproceedingsO.-A.Odalric-Ambrym Maillard, H.Hippolyte Bourel and M.‌ S.Mohammad Sadegh Talebi‌. Tightening Exploration in‌‌ Upper Confidence Reinforcement Learning.International Conference on‌ Machine LearningVienna, Austria‌July 2020HAL
14‌‌ articleT.Timothée Mathieu, R.Riccardo Della‌ Vecchia, A.Alena‌ Shilova, M.Matheus‌‌ Medeiros Centa, H.Hector Kohler, O.-A.‌Odalric-Ambrym Maillard and P.‌Philippe Preux. AdaStop:‌‌ adaptive statistical testing for sound comparisons of Deep‌ RL agents.Transactions‌ on Machine Learning Research‌‌ Journal2024HAL
15 inproceedingsF.Fabien Pesquerel‌ and O.-A.Odalric-Ambrym Maillard‌. IMED-RL: Regret optimal‌‌ learning of ergodic Markov decision processes.NeurIPS‌ 2022 - Thirty-sixth Conference‌ on Neural Information Processing‌‌ SystemsThirty-sixth Conference on Neural Information Processing Systems‌New-Orleans, United StatesNovember‌ 2022HAL

11.2 Publications‌‌ of the year

International journals

16 articleD.‌Debabrota Basu and D.‌Debarshi Chanda. Sublinear‌‌ Algorithms for Estimating Wasserstein and Total Variation Distances:‌ Applications to Fairness and‌ Privacy Auditing.Transactions‌‌ on Machine Learning Research JournalJanuary 2026HAL‌back to text
17‌ articleD.Debabrota Basu‌‌ and U.Udvas Das. The Fair Game:‌ Auditing & debiasing AI‌ algorithms over time.‌‌Cambridge Forum on AI: Law and Governance1‌June 2025, 27‌HAL DOI back to‌‌ text
18 articleR. M.Reabetswe M. Nkhumise‌, D.Debabrota Basu‌, T. J.Tony‌‌ J. Prescott and A.Aditya Gilra. Studying‌ Exploration in RL: An‌ Optimal Transport Analysis of‌‌ Occupancy Measure Trajectories.Transactions on Machine Learning‌ Research JournalMay 2025‌HAL back to text‌‌
19 articleW.Waris Radji and O.-A.Odalric-Ambrym‌ Maillard. The Confusing‌ Instance Principle for Online‌‌ Linear Quadratic Control.Reinforcement Learning Journal6‌August 2025, 811–828‌HAL back to text‌‌
20 articleE. C.Eduardo Charles Vasconcellos,‌ Á. P.Álvaro Pinto‌ Fernandes Negreiros, A.‌‌ P.André Paulo Dantas de Araújo, R.‌Raphael Guerra, P.‌Philippe Preux, D.‌‌ H.Davi Henrique dos Santos, L. M.‌Luiz Marcos Garcia Gonçalves‌ and E. W.Esteban‌‌ Walter Gonzalez Clua. Yara: An Ocean Virtual‌ Environment for Research and‌ Development of Autonomous Sailing‌‌ Robots and Other Unmanned Surface Vessels.Journal‌ of Intelligent and Robotic‌ Systems1113July‌‌ 2025, 78HALDOI back to text‌
21 articleS.Sumit‌ Vashishtha and O.-A.Odalric-Ambrym‌‌ Maillard. Leveraging priors on distribution functions for‌ multi-arm bandits.Reinforcement‌ Learning Journal62025‌‌, 1600-1623In press. HAL DOI back to‌ text

International peer-reviewed conferences‌

22 inproceedingsA.Ayoub‌‌ Ajarra, B.Bishwamittra Ghosh and D.Debabrota‌ Basu. Active Fourier‌ Auditor for Estimating Distributional‌‌ Properties of ML Models.AAAI Conference on‌ Artificial IntelligencePhiladelphia, United‌ StatesFebruary 2025HAL‌‌back to text
23 inproceedingsN. A.Naheed‌ Anjum Arafat, D.‌Debabrota Basu, Y.‌‌Yulia Gel and Y.Yuzhou Chen. When‌ Witnesses Defend: A Witness‌ Graph Topological Layer for‌‌ Adversarial Graph Learning.‌AAAI Conference on Artificial IntelligencePhiladelphia, United States‌February 2025HAL back to text
24 inproceedings‌A.Achraf Azize and D.Debabrota Basu.‌ Some Targets Are Harder to Identify than Others:‌ Quantifying the Target-dependent Membership Leakage.AISTATS 2025‌ – International Conference on Artificial Intelligence and Statistics‌Phuket, ThailandMay 2025HAL back to text‌
25 inproceedingsA.Achraf Azize, Y.Yulian‌ Wu, J.Junya Honda, F.Francesco‌ Orabona, S.Shinji Ito and D.Debabrota‌ Basu. Optimal Regret of Bandits under Differential‌ Privacy.NeurIPS 2025 - 39th Annual Conference‌ on Neural Information Processing SystemsSan Diego (USA),‌ United States2025HALback to text
26‌ inproceedingsM.Mickaël Basson and P.Philippe Preux‌. Improving Diffusion Models for the Traveling Salesman‌ Problem (TSP) by Leveraging the Structure of the‌ Solution Space.Lecture Notes in Computer Science‌11th Annual Conference on machine Learning, Optimization and‌ Data science (LOD 2025)LNCSRiva del Sole,‌ Toscane, Italy2025HALback to text
27‌ inproceedingsE.Emilien Biré, A.Anthony Kobanda‌, L.Ludovic Denoyer and R.Rémy Portelas‌. Efficient Active Imitation Learning with Random Network‌ Distillation.ICLRSingapore, SingaporeApril 2025HAL‌back to text
28 inproceedingsS.Sunrit Chakraborty‌, S.Saptarshi Roy and D.Debabrota Basu‌. FLIPHAT: Joint Differential Privacy for High Dimensional‌ Sparse Linear Bandits.AISTATS 2025 – International‌ Conference on Artificial Intelligence and StatisticsPhuket, Thailand‌May 2025HAL back to text
29 inproceedings‌U.Udvas Das and D.Debabrota Basu.‌ Learning to Explore with Lagrangians for Bandits under‌ Unknown Linear Constraints.Twenty-Ninth Annual Conference on‌ Artificial Intelligence and Statistics (AISTATS)Tangier, MoroccoMay‌ 2025HAL
30 inproceedingsU.Udvas Das,‌ A.Apurv Shukla and D.Debabrota Basu.‌ FraPPE: Fast and Efficient Preference-based Pure Exploration.‌NeurIPS 2025 - 39th Annual Conference on Neural‌ Information Processing SystemsSan Diego (USA), United States‌December 2025HAL back to text
31 inproceedings‌R.Riccardo Della Vecchia and D.Debabrota Basu‌. Stochastic Online Instrumental Variable Regression: Regrets for‌ Endogeneity and Bandit Feedback.AAAI Conference on‌ Artificial IntelligencePhiladelphia, United StatesFebruary 2025HAL‌back to text
32 inproceedingsH.Hector Kohler‌, R.Riad Akrour and P.Philippe Preux‌. Breiman meets Bellman: Non-Greedy Decision Trees with‌ MDPs.Proceedings of the 31st ACM SIGKDD‌ Conference on Knowledge Discovery and Data MiningKDD‌ 2025 - The 31st ACM SIGKDD Conference on‌ Knowledge Discovery and Data Mining2Toronto, Canada‌ACM2025, 1207-1218HAL DOI back to‌ text
33 inproceedingsC.Cyrille Kone, M.‌Marc Jourdan and E.Emilie Kaufmann. Pareto‌ Set Identification With Posterior Sampling.Proceedings of‌ Machine Learning ResearchAISTATS 2025 - 28th International‌ Conference on Artificial In- telligence and StatisticPhuket,‌ ThailandMay 2025HALback to text
34‌ inproceedingsC.Cyrille Kone, E.Emilie Kaufmann and L.Laura Richert‌. Bandit Pareto Set‌ Identification in a Multi-Output‌‌ Linear Model.Proceedings of Machine Learning Research‌AISTATS 2025 - 28th‌ International Conference on Artificial‌‌ Intelligence and StatisticsPhuket, ThailandMay 2025HAL‌back to text
35‌ inproceedingsC.Cyrille Kone‌‌, E.Emilie Kaufmann and L.Laura Richert‌. Constrained Pareto Set‌ Identification with Bandit Feedback‌‌.Proceedings of Machine Learning ResearchICML 2025‌ - 42nd International Conference‌ on Machine LearningVancouver,‌‌ CanadaJuly 2025HALback to text
36‌ inproceedingsT.Thomas Michel‌, D.Debabrota Basu‌‌ and E.Emilie Kaufmann. DP-SPRT: Differentially Private‌ Sequential Probability Ratio Tests‌.Twenty-Ninth Annual Conference‌‌ on Artificial Intelligence and Statistics (AISTATS)Tangier, Morocco‌May 2026HAL DOI‌back to text
37‌‌ inproceedingsT.Thomas Michel, M.Marko Cvjetko‌, G.Gautier Hamon‌, P.-Y.Pierre-Yves Oudeyer‌‌ and C.Clément Moulin-Frier. Exploring Flow-Lenia Universes‌ with a Curiosity-driven AI‌ Scientist: Discovering Diverse Ecosystem‌‌ Dynamics.Artificial Life Conference Proceedings 37ALIFE‌ 2025 - Conference on‌ Artificial Life20251‌‌Kyoto / Virtual, Japan2025, 68HAL‌DOI back to text‌
38 inproceedingsA.Arpan‌‌ Mukherjee, M.Marcello Bullo, D.Debabrota‌ Basu and D.Deniz‌ Gündüz. Test-time Verification‌‌ via Optimal Transport: Coverage, ROC, & Sub-optimality.‌The Fourteenth International Conference‌ on Learning Representations (ICLR)‌‌Rio de Janeiro (BRAZIL), BrazilOctober 2025HAL‌
39 inproceedingsR.Riccardo‌ Poiani, M.Marc‌‌ Jourdan, E.Emilie Kaufmann and R.Rémy‌ Degenne. Best-Arm Identification‌ in Unimodal Bandits.‌‌Proceedings of Machine Learning ResearchAISTATS 2025 -‌ 28th International Conference on‌ Artificial Intelligence and Statistics‌‌Phuket, ThailandMay 2025HAL back to text‌
40 inproceedingsA.Adrienne‌ Tuynman and R.Rémy‌‌ Degenne. The Batch Complexity of Bandit Pure‌ Exploration.PMLRICML‌ 2025 - 42nd International‌‌ Conference on Machine Learning267Vancouver, Canada2025‌, 60442--60468HAL back‌ to text

Conferences without‌‌ proceedings

41 inproceedingsB.Brahim Driss, A.‌Alex Davey and R.‌Riad Akrour. PB²:‌‌ Preference Space Exploration via Population-Based Methods in Preference-Based‌ Reinforcement Learning.EWRL‌ — Eighteenth European Workshop‌‌ on Reinforcement LearningTubingen, GermanySeptember 2025HAL‌back to text
42‌ inproceedingsA.Anthony Kobanda‌‌, O.-A.Odalric-Ambrym Maillard and R.Rémy Portelas‌. A Continual Offline‌ Reinforcement Learning Benchmark for‌‌ Navigation Tasks.IEEE Conference on GamesLisboa,‌ PortugalJune 2025HAL‌back to text
43‌‌ inproceedingsA.Anthony Kobanda, R.Rémy Portelas‌, O.-A.Odalric-Ambrym Maillard‌ and L.Ludovic Denoyer‌‌. Hierarchical Subspaces of Policies for Continual Offline‌ Reinforcement Learning.MCDC‌ Workshop at ICLR 2025‌‌Singapore, SingaporeApril 2025HAL back to text‌
44 inproceedingsA.Alena‌ Shilova, A.Alex‌‌ Davey, B.Brahim Driss and R.Riad‌ Akrour. StaQ it!‌ Growing neural networks for‌‌ Policy Mirror Descent.EWRL — Eighteenth European‌ Workshop on Reinforcement Learning‌Tubingen, Germany2025HAL‌‌DOI back to text‌
45 inproceedingsA.Apurv Shukla and D.Debabrota‌ Basu. Unifying (Federated) (Private) High-Dimensional Bandits via‌ ADMM.EWRL -- Eighteenth European Workshop on‌ Reinforcement LearningTuebingen, Germany, GermanySeptember 2025HAL‌back to text

Reports & preprints

46 misc‌D.Debabrota Basu, S.Sourav Chakraborty,‌ D.Debarshi Chanda, B. D.Buddha Dev‌ Das, A.Arijit Ghosh and A.Arnab‌ Ray. Dimension Agnostic Testing of Survey Data‌ Credibility through the Lens of Regression.October‌ 2025HAL back to text
47 miscD.‌Debabrota Basu, U.Udvas Das, B.‌Brahim Driss and U.Uddalak Mukherjee. Performative‌ Policy Gradient: Optimality in Performative Reinforcement Learning.‌December 2025HAL
48 miscV.Victor Boone‌ and A.Adrienne Tuynman. Towards Blackwell Optimality:‌ Bellman Optimality Is All You Can Get.‌October 2025HAL back to text
49 misc‌R.Rémy Degenne, D.David Ledvinka,‌ E.Etienne Marion and P.Peter Pfaffelhuber.‌ Formalization of Brownian motion in Lean.November‌ 2025HAL back to text
50 miscR.‌Rémy Degenne. Markov kernels in Mathlib's probability‌ library.October 2025HAL back to text‌
51 miscF.Frédérick Fabre Ferber, D.‌Dominique Gay, J.-C.Jean-Christophe Soulié, J.‌Jean Diatta and O.-A.Odalric-Ambrym Maillard. Kriging‌ and Gaussian Process Interpolation for Georeferenced Data Augmentation‌.January 2025HALback to text
52‌ miscG.Guillaume Pourcel, D.Debabrota Basu‌, M.Maxence Ernoult and A.Aditya Gilra‌. Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary‌ conditions & equivalence with Hamiltonian Echo Learning.‌June 2025HAL back to text
53 misc‌ W.Waris Radji and O.-A.Odalric-Ambrym Maillard.‌ How Hard is it to Confuse a World‌ Model? October 2025 HAL back to text

11.3‌ Cited publications

54 bookM.M. Puterman.‌ Markov Decision Processes: Discrete Stochastic Dynamic Programming.‌John Wiley & Sons1994back to text‌
55 unpublishedB.B. Recht. A Tour‌ of Reinforcement Learning: The View from Continuous Control‌.2018, arxiv preprint 1806.09460back to‌ text
56 bookR.R.S. Sutton and A.‌A. Barto. Reinforcement Learning: an Introduction.‌http://incompleteideas.net/book/the-book-2nd.htmlMIT Press2018back to text
57‌ bookC.C. Szepesvári and T.T. Lattimore‌. Bandit Algorithms.Cambridge University press2019‌back to text

SCOOL - 2025

SCOOL - 2025

2025Activity report​​​‌Project-TeamSCOOL

Keywords

Computer Science﻿​﻿﻿ and Digital Science

Other Research Topics​​​‌ and Application Domains

1 Team members, visitors,﻿​﻿﻿ external collaborators

Research Scientists​‌﻿﻿

Faculty Members

Post-Doctoral Fellows

PhD Students﻿​﻿﻿

Technical​‌﻿﻿ Staff

Interns and​‌﻿﻿ Apprentices

Administrative Assistant

2 Overall objectives

3 Research program​​​‌

4​‌﻿﻿ Application domains

5 Social and environmental﻿﻿﻿‌ responsibility

6 Highlights of﻿​​﻿ the year

7 New results﻿﻿﻿‌

7.1 Bandits﻿​​﻿ and RL theory

7.2 Bandits and RL﻿‌​‌ under Real-life constraints

7.3 Bandits and​​​‌ RL for real-life: Deep﻿​﻿﻿ RL and Applications

7.4﻿﻿﻿‌ Others

7.4.1 Responsible﻿‌​‌ AI and Algorithmic Auditing﻿​​﻿

7.4.2 Formalization of​‌﻿﻿ mathematics

8 Bilateral contracts and​​​‌ grants with industry

8.1﻿﻿﻿‌ Bilateral contracts with industry﻿‌​‌

9 Partnerships﻿‌​‌ and cooperations

9.1﻿‌​‌ International initiatives

9.1.1 Inria﻿​​﻿ associate team not involved​​​‌ in an IIL or﻿﻿﻿‌ an international program

SeRAI﻿‌​‌

9.2 International​​﻿﻿ research visitors

9.2.1 Visits​​​‌ of international scientists

Other﻿​﻿﻿ international visits to the​‌﻿﻿ team

Ahana Deb

Pratik Karmakar​‌﻿﻿

9.2.2 Visits﻿​﻿﻿ to international teams

Research​‌﻿﻿ stays abroad

Hector Kohler​​﻿﻿

Cyrille​​﻿﻿ Koné

9.3 National initiatives

9.3.1​‌﻿﻿ ANR projects

9.3.2 PEPR projects

9.3.3 Other​‌﻿﻿ projects in France

9.3.4 Inria Exploratory​​​‌ Actions

10 Dissemination​​﻿﻿

10.1​​﻿﻿ Promoting scientific activities

10.1.1​​​‌ Scientific events: organisation

10.1.2 Scientific events:﻿​﻿﻿ selection

Member of the​‌﻿﻿ conference program committees

Reviewer​​​‌

10.1.3 Journal

Reviewer﻿​​﻿ - reviewing activities

10.1.4﻿﻿﻿‌ Invited talks

10.1.5 Scientific﻿‌​‌ expertise

10.1.6 Research administration

10.2 Teaching﻿​﻿﻿ - Supervision - Juries​‌﻿﻿ - Educational and pedagogical​​﻿﻿ outreach

10.2.1​​﻿﻿ Supervision

10.2.2 Juries

10.2.3﻿‌​‌ Educational and pedagogical outreach﻿​​﻿

10.3 Popularization﻿‌​‌

10.3.1 Productions (articles, videos,﻿​​﻿ podcasts, serious games, ...)​​​‌

10.3.2 Participation in Live﻿‌​‌ events

10.3.3 Others science﻿﻿﻿‌ outreach relevant activities

11 Scientific production﻿﻿﻿‌

11.1 Major publications

11.2 Publications﻿‌​‌ of the year

International﻿​​﻿ journals

International peer-reviewed conferences﻿﻿﻿‌

Conferences without﻿‌​‌ proceedings

Reports﻿​﻿﻿ & preprints

11.3​‌﻿﻿ Cited publications

2025Activity report‌Project-TeamSCOOL

Computer Science and Digital Science

Other Research Topics‌ and Application Domains

1 Team members, visitors, external collaborators

Research Scientists‌

PhD Students

Technical‌ Staff

Interns and‌ Apprentices

3 Research program‌

4‌ Application domains

5 Social and environmental‌ responsibility

6 Highlights of the year

7 New results‌

7.1 Bandits and RL theory

7.2 Bandits and RL‌‌ under Real-life constraints

7.3 Bandits and‌ RL for real-life: Deep RL and Applications

7.4‌ Others

7.4.1 Responsible‌‌ AI and Algorithmic Auditing

7.4.2 Formalization of‌ mathematics

8 Bilateral contracts and‌ grants with industry

8.1‌ Bilateral contracts with industry‌‌

9 Partnerships‌‌ and cooperations

9.1‌‌ International initiatives

9.1.1 Inria associate team not involved‌ in an IIL or‌ an international program

SeRAI‌‌

9.2 International research visitors

9.2.1 Visits‌ of international scientists

Other international visits to the‌ team

Pratik Karmakar‌

9.2.2 Visits to international teams

Research‌ stays abroad

Hector Kohler

Cyrille Koné

9.3.1‌ ANR projects

9.3.3 Other‌ projects in France

9.3.4 Inria Exploratory‌ Actions

10 Dissemination

10.1 Promoting scientific activities

10.1.1‌ Scientific events: organisation

10.1.2 Scientific events: selection

Member of the‌ conference program committees

Reviewer‌

Reviewer - reviewing activities

10.1.4‌ Invited talks

10.1.5 Scientific‌‌ expertise

10.2 Teaching - Supervision - Juries‌ - Educational and pedagogical outreach

10.2.1 Supervision

10.2.3‌‌ Educational and pedagogical outreach

10.3 Popularization‌‌

10.3.1 Productions (articles, videos, podcasts, serious games, ...)‌

10.3.2 Participation in Live‌‌ events

10.3.3 Others science‌ outreach relevant activities

11 Scientific production‌

11.2 Publications‌‌ of the year

International journals

International peer-reviewed conferences‌

Conferences without‌‌ proceedings

Reports & preprints

11.3‌ Cited publications