## Section: New Results

### Estimation and control for stochastic processes

#### Piecewise-deterministic Markov processes

Participants: Romain Azaïs, Florian Bouguet, Anne Gégout-Petit, Florine Greciet, Aurélie Muller-Gueudin

External participants: Michel Benaïm (Université de Neuchâtel), Bertrand Cloez (Inra-SupAgro MISTEA), Alexandre Genadot (Inria CQFD, Université de Bordeaux)

A piecewise-deterministic Markov process is a stochastic process whose behavior is governed by an ordinary differential equation punctuated by random jumps occurring at random times. This class of stochastic processes offers a wide range of applications, especially in biology (kinetic diatery exposure model and growth of bacteria for example). BIGS' members mainly work on statistical inference techniques for these stochastic processes [2], [29], which is an essential step to build relevant application models. We also investigate the probabilistic properties of these processes [32], [31] as well as the application in reliability to crack growth in some alloy in the industrial context of the PhD thesis of Florine Greciet with SAFRAN Aircraft Engines [33].

In a preprint recently accepted for publication in Electronic Journal of Statistics [2], we focus on the nonparametric estimation problem of the jump rate for piecewise-deterministic Markov processes observed within a long time interval under an ergodicity condition. More precisely, we introduce an uncountable class (indexed by the deterministic flow) of recursive kernel estimates of the jump rate and we establish their strong pointwise consistency as well as their asymptotic normality. In addition, we propose to choose among this class the estimator with the minimal variance, which is unfortunately unknown and thus remains to be estimated. We also discuss the choice of the bandwidth parameters by cross-validation methods. In [29], we state a new characterization of the jump rate when the transition kernel only charges a discrete subset of the state space. We deduce from this result a competitive nonparametric technique for estimating this feature of interest. We state the uniform convergence in probability of the estimator. Both the methodologies have been illustrated on numerical examples and real data.

The article [32] deals with a class of conservative growth-fragmentation equations with a deterministic viewpoint. With the help of Foster-Lyapunov criteria, we study the long-time behavior of some associated piecewise-deterministic Markov process, which represents a typical individual following the dynamics of the equation. If the growth and the fragmentation are balanced, it is possible to provide existence and unicity for the stationary distribution on the process, as well as precise bounds for its tails of distributions in the neighborhoods of both 0 and $+\infty $. Our probabilistic results are systematically compared to estimates already obtained with deterministic methods.

In [31], we are interested by the long-time behavior of inhomogeneous-time Markov chains. We put forward an original and unified approach to relate some of their asymptotic properties (stationary distribution, speed of convergence, ...) to the ones of an auxiliary homogeneous-time Markov process. Such results are close to traditional functional limit theorems, but our method differs from the standard “Tightness/Identification” argument; it is based on the notion of asymptotic pseudotrajectories on the space of probability measures. We recover classical results, such as normalized bandit algorithms converging to a piecewise-deterministic Markov process, or weighted random walks or decreasing step Euler schemes approximated with solutions of stochastic differential equations.

#### Statistics of Markov chains

Participant: Romain Azaïs

External participants: Bernard Delyon (Université Rennes 1), François Portier (Télécom ParisTech)

Suppose that a mobile sensor describes a Markovian trajectory in the ambient space. At each time the sensor measures an attribute of interest, e.g., the temperature. Using only the location history of the sensor and the associated measurements, the aim of the paper [27] is to estimate the average value of the attribute over the space. In contrast to classical probabilistic integration methods, e.g., Monte Carlo, the proposed approach does not require any knowledge on the distribution of the sensor trajectory. Probabilistic bounds on the convergence rates of the estimator are established. These rates are better than the traditional “root $n$”-rate, where $n$ is the sample size, attached to other probabilistic integration methods. For finite sample sizes, the good behaviour of the procedure is demonstrated through simulations and an application to the evaluation of the average temperature of oceans is considered.

#### Realtime Tracking of the Photobleaching Trajectory during Photodynamic Therapy

Participant: T. Bastogne

Photodynamic therapy (PDT) is an alternative treatment for cancer that involves the administration of a photosensitizing agent, which is activated by light at a specific wavelength. This illumination causes after a sequence of photoreactions, the production of reactive oxygen species responsible for the death of the tumor cells but also the degradation of the photosensitizing agent, which then loose the fluorescence properties. The phenomenon is commonly known as photobleaching process and can be considered as a therapy efficiency indicator. In [8], we present the design and validation of a real time controller able to track a preset photobleaching trajectory by modulating the light impulses width during the treatment sessions. This innovative solution was validated by in vivo experiments that have shown a significantly improvement of reproducibility of the inter-individual photobleaching kinetic. We believe that this approach could lead to personalized photodynamic therapy modalities in the near future.

#### Stochastic simulation and design of numerical experiments for the prediction of nanoparticles/X-ray interactions in radiotherapy.

Participant: T. Bastogne

The increase of computational environments dedicated to the simulation of nanoparticles (NP)-X-Rays interactions has opened new perspectives in computer-aided-design of nanostructured materials for biomedical applications. Several published studies have shown a crucial need of standardization of these numerical simulations [92]. That is why, we proposed to perform a robustness multivariate analysis in [8]. A gold nanoparticle (GNP) of 100 nm diameter was selected as a standard nano-system activated by a X-ray source placed just below the NP. Two response variables were examined: the dose enhancement in seven different spatial regions of interest around the NP and the duration of the experiments. 9 factors were pre-identified as potentially critical. A Plackett-Burman design of numerical experiments was applied to estimate and test the effects of each simulation factors on the examined responses. Four factors: the working volume, the spatial resolution, the spatial cutoff and the computational mode (parallelization) do not significantly affect the dose deposition results and none except the last one may reduce the computational duration. The energy cutoff may cause significant variations of the dose enhancement in some specific regions of interest: the higher the cutoff, the closer the secondary particles will stop from the GNP. By contrast, the Auger effect as well as the choice of the physical medium and the fluence level clearly appear as critical simulation parameters. Consequently, these four factors may be compulsory examined before comparing and interpreting any simulation results coming from different simulation sessions.

In [9], we address the prediction issue of organometallic nanoparticles (NPs)-based radiosensitization enhancement. The goal was to carry out computational experiments to quickly identify efficient nanostructures and then to preferentially select the most promising ones for the subsequent in vivo studies. To this aim, this interdisciplinary article introduces a new theoretical Monte Carlo computational ranking method and tests it using 3 different organometallic NPs in terms of size and composition. While the ranking predicted in a classical theoretical scenario did not fit the reference results at all, in contrast, we showed for the first time how our accelerated in silico virtual screening method, based on basic in vitro experimental data (which takes into account the NPs cell biodistribution), was able to predict a relevant ranking in accordance with in vitro clonogenic efficiency. This corroborates the pertinence of such a prior ranking method that could speed up the preclinical development of NPs in radiation therapy.

This in-silico approach was tested in [25] to screen radiosensitizing nanoparticles and the results have been validated by in vitro assays.

#### Complexity analysis of Policy Iteration

Participant: Bruno Scherrer

Given a Markov Decision Process (MDP) with $n$ states and a total number $m$ of actions, we study in [10] the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma $-discounted policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $O\left(\frac{m}{1-\gamma}log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(logn)$ a result by Hansen et al., while Simplex-PI terminates after at most $O\left(\frac{nm}{1-\gamma}log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(logn)$ a result by Ye. Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor $\gamma $: quantities of interest are bounds ${\tau}_{t}$ and ${\tau}_{r}$—uniform on all states and policies—respectively on the *expected time spent in transient states* and *the inverse of the frequency of visits in recurrent states* given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most $\tilde{O}\left({n}^{3}{m}^{2}{\tau}_{t}{\tau}_{r}\right)$ iterations. This extends a recent result for deterministic MDPs by Post & Ye, in which ${\tau}_{t}\le 1$ and ${\tau}_{r}\le n$; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that both Howard's PI and Simplex-PI terminate after at most $\tilde{O}\left(m({n}^{2}{\tau}_{t}+n{\tau}_{r})\right)$ iterations.

#### Approximate Dynamic Programming for Markov Games

Participant: Bruno Scherrer

We have made two contributions to the analysis of Approximate Dynamic Programming algorithms for Markov Games.

First, we extend in [21] several non-stationary Reinforcement Learning (RL) algorithms and their theoretical guarantees to the case of discounted zero-sum Markov Games (MGs). As in the case of Markov Decision Processes (MDPs), non-stationary algorithms are shown to exhibit better performance bounds compared to their stationary counterparts. The obtained bounds are generically composed of three terms: 1) a dependency over gamma (discount factor), 2) a concentrability coefficient and 3) a propagation error term. This error, depending on the algorithm, can be caused by a regression step, a policy evaluation step or a best-response evaluation step. As a second contribution, we empirically demonstrate, on generic MGs (called Garnets), that non-stationary algorithms outperform their stationary counterparts. In addition, it is shown that their performance mostly depends on the nature of the propagation error. Indeed, algorithms where the error is due to the evaluation of a best-response are penalized (even if they exhibit better concentrability coefficients and dependencies on gamma) compared to those suffering from a regression error.

Furthermore, we report in [22] theoretical and empirical investigations on the use of quasi-Newton methods to minimize the Optimal Bellman Residual (OBR) of zero-sum two-player Markov Games. First, it reveals that state-of-the-art algorithms can be derived by the direct application of Newton's method to different norms of the OBR. More precisely, when applied to the norm of the OBR, Newton's method results in the Bellman Residual Minimization Policy Iteration (BRMPI) and, when applied to the norm of the Projected OBR (POBR), it results into the standard Least Squares Policy Iteration (LSPI) algorithm. Consequently, new algorithms are proposed, making use of quasi-Newton methods to minimize the OBR and the POBR so as to take benefit of enhanced empirical performances at low cost. Indeed, using a quasi-Newton method approach introduces slight modifications in term of coding of LSPI and BRMPI but improves significantly both the stability and the performance of those algorithms. These phenomena are illustrated on an experiment conducted on artificially constructed games called Garnets.