CQFD is an INRIA Team joint with University of Bordeaux (UB1, UB2 and UB4) and CNRS (IMB, UMR 5251 and IMS, UMR 5218)
Economic, scientific and military competition leads many industrial sectors to design ever more successful and reliable processes and equipment.
Reliability and quality control, and more generally dependability and safety, have become a crucial area in the field of industrial engineering. The term reliability is an acquisition of the 20th century. Initially, reliability was developed to meet the needs of the electronics industry. This was a consequence of the fact that the first complex systems appeared in this field of engineering. Such systems have a huge number of components which made their global reliability very low in spite of their relatively highly reliable components. This led to a specialized applied mathematical discipline which allowed one to make an a priori evaluation of various reliability indexes at the design stage, to choose an optimal system structure, to improve methods of maintenance, and to estimate the reliability on the basis of special testing or exploitation.
Our objective is to apply probabilistic and statistical tools from estimation and control theory to dependability and safety. We wish to investigate the following fields:
Design and analysis of realistic and accurate random models for dependability. In particular, we will study parametric models for dynamic reliability and semi- or non-parametric models for quality control.
Implementation of estimation algorithms in relation with our stochastic models and evaluation of reliability indexes.
Design of control for maintenance and reconfiguration.
We stress the fact that points 1)and 2)are strongly interlinked. Indeed, designing mathematical models for reliability is an important and basic research field. However, our models will be legitimated and practically validated by point 2). In particular, the feasibility of estimation routines and the quality of the evaluation of reliability indexes will be crucial. Point 3)deals with control through practical issues of maintenance and reconfiguration. This last point legitimates and is based on the first two ones: only after modelling, identifying and evaluating reliability indexes, shall we be able to compute a cost criterion to be optimized.
The team CQFD is very involved in the organization of the "41e Journées de Statistique"Annual event of the "Société Française de Statistique" which will take place in Bordeaux in 2009. Reliability and Quality Control is one of the main topics of the meeting.
Since september 2007, P. Del Moral is developing a new research project on evolutionary type stochastic algorithms. This emerging project is oriented towards concrete applications with important potential industrial transfers on two central problems in advanced stochastic engineering; namely, Bayesian inference and rare event simulation, and more particularly on the following subjects : unsupervised learning, multi-target tracking, data assimilation, epidemic and micro-biology predictions.
The researchers involved in this emerging INRIA research team project, named Advanced Learning Evolutionary Algorithms (abbreviate ALEA)are :
F. Caron (CR INRIA) and M. Pace (PhD student, INRIA), J.F. Marckert (Labri), J. Garnier (Paris 7), and A. Doucet (UBC, Vancouver).
Piecewise Deterministic Markov Process
In dependability and safety theory, modeling is a key step to study the properties of the physical processes involved. Nowadays, it appears necessary to take explicitly and realistically the dependencies into account, meaning the dynamic interactions existing between the physical parameters (for example: pressure, temperature, flow rate, level, ...) of the system and the functional and dysfunctional behavior of its components. Classically, the models described in the dependability and safety literature do not take such interactions into account. A first set of methods used in reliability theory is the so-called combinatory approaches (fault trees, event trees, reliability diagrams and networks, ...), which can be used to identify and evaluate the combinations of events leading to the occurrence of other desirable or undesirable events. These powerful methods suffer from the fact that such combinations do not take the order of occurrence into account, in the sense that they eliminate any notion of dependency between events. A second set of methods is described by finite state Markov (or semi-Markov) models. In this context, the system is described by a fixed number of components which can be in different states. For any component, the set of its possible states is assumed to be finite (generally it contains only two elements: an operational and a failure state). One of the main limitations encountered with such models is their difficulties to model correctly the physical processes involving deterministic behavior. To overcome such difficulties, dynamic reliability was introduced in 1980 as a powerful mathematical framework capable of explicitly handling interactions between components and process variables. Nowadays in the literature, the multi-model approach appears as a natural framework to formulate dynamic reliability problems. The behavior of the physical model is thus described by different modes of operation from nominal to failure states with intermediate dysfunctional regimes. For a large class of industrial processes, the layout of operational or accident sequences generally comes from the occurrence of two types of events:
The first type of event is directly linked to a deterministic evolution of the physical parameters of the process.
The second type of event is purely stochastic and usually corresponds to random demands or failures of system components.
In both cases, these events will induce jumps in the behavior of the system leading to stable or unstable trajectories for the process.
In 1980, M.H.A. Davis
introduced in probability theory the Piecewise Deterministic Markov Processes (PDMP) as a general class of models
suitable for formulating optimization problems in queuing and inventory systems, maintenance-replacement models, investment scheduling and many other areas of operation research. These models
are described by two variables: to the usual Euclidean variable representative of the state of the process, one adds a discrete variable, called regime or mode and takes values in a finite or
countable set. In this context, the state variable represents the physical parameters of the system under consideration. For example, it can be the position or the orientation of a satellite
or the pressure in a tank. The mode characterizes the regimes of operation of the physical system from nominal to failure regime. From a mathematical point of view, the notion of a piecewise
deterministic Markov process is very intuitive and simple to describe. The state space of this system is given, for example, by an open subset
Eof the set
. Starting from
xE, the process follows a deterministic trajectory, namely a flow indexed by the mode, until the first jump time
T1, which occurs either spontaneously in a random manner or when the trajectory hits the boundary of
E. Between two jumps, the mode is assumed to be constant. In both cases, a new point and a new regime are selected by a random operator and the process restarts from this new point
under this new mode. There exist two types of jump :
The first one is deterministic. From the mathematical point of view, it is given by the fact that the trajectory hits the boundary of E. From the physical point of view, it can be seen as a modification of the mode of operation when a physical parameter reaches a prescribed level (for example when the pressure of a tank reaches the critical value).
The second one is stochastic. It models the random nature of failures or inputs that modify the mode of operation of the system.
As it has been illustrated above, the key asset of this mathematical model is that it takes naturally into account the two kinds of events previously described. Several examples can be found in , , , and . Most stochastic processes presented in T. Aven and U. Jensen are special cases of piecewise deterministic Markov processes.
In conclusion, it is claimed that piecewise deterministic Markov processes provide a general framework to study dynamic reliability problems. Their dynamical properties allow explicit time dependencies, in contrast with piecewise constant jump Markov processes. Consequently, these processes are really suitable for modeling real phenomena of dynamic reliability.
The probabilistic background offers a very suitable framework to evaluate material quality from the dependability and safety point of view. One can classically characterize the performances of a system by several indicators : availability, reliability, maintainability, safety, etc. Evaluating all these indicators is crucial. It makes it possible to calculate a certain costin order to measure the performances of the system. Hence, the well-known topic in control, called robustness, is given emphasis. In this framework, it is necessary to define the concept of subsystem and sensitivity:
Which are the subsystems of greater impact on the cost?
What is the evolution of cost sensibility with respect to modifications of one or several components of the system ?
For instance, evaluating the production availability of a factory is a vital concern for the industrial world. This notion complements the more classical notions of instantaneous availability and asymptotic availability. Production availability is a probability measure of the regularity of production. Previously, its calculation was usually based on the naive hypothesis that the production level associated with each mode of regime (operational, damaged, partial breakdown, ...) was instantaneously reached as soon as the system entered that state. Consequently, modeling of production availability was done via a discrete random variable and a typical trajectory of the production availability was piecewise constant. It was shown in on a large set of real cases that this hypothesis was not realistic. In fact, the production level evolves continuously and is influenced by the mode of regime as well as internal variables of the system such as pressure, temperature, etc.
Quite obviously, it is necessary to take into account the naturally continuous dynamic of indicators in Reliability, Availability and Maintainability (RAM). In particular, we shall see that the so-called Piecewise Deterministic Markov Processes are very suitable tools to define and evaluate the indicators in RAM for physical systems.
We are also interested in evaluating the occurrence of rare and critical events. In this context, random tree based algorithms have recently been applied with success in generating the excursion distributions of Markov processes evolving in critical and rare event regimes. For a rather detailed discussion on these advanced particle techniques with their applications in stochastic engineering we refer to the research monograph and to the more recent article dedicated to rare event simulation . In the path integration formalism, the distributions of a process evolving in a rare event can always be represented as a Feynman-Kac measures on trajectory spaces or on excursions between regions. In this interpretation, we stress that the occupation measures of genealogical trees associated with the corresponding genetic type particle algorithms give a precise statistical description of the strategy employed by the random process during these rare events. These descriptions allows thus to analyze the chain of elementary events leading the process to enter in such critical regions.
In the domain of safety and dependability, the notions of control and maintainability play a prominent role in the design of reliable systems. This maintainability can be activeor passive.
In this context of vulnerability, the usual way to make the system more tolerant towards failures is to introduce several redundancies. We improve the reliability of a system not only by improving the reliability of its components but also via their redundant organization. This is commonly called passive maintainability. However, it is not always possible to introduce direct physical redundancy which clearly restricts the usefulness of this approach. For example, it seems impossible to put motor units or pressure transducers in the same place on certain structures such as oil wells or communication satellites.
A second approach, more realistic and promising, is active maintainability. It is organized in the two following steps:
Detection and identification of failures,
Reconfiguration of the system.
In this context, PDMP are especially well-adapted for modeling real physical systems. In fact, a natural approach is first to make out a list of possible failures or breakdowns. This will lead to the constitution of a set of regimes or modes for the system described in section . Then, after the detection and identification of all those regimes, the control or maintainability process will be in a position to react and maintain the system in a damaged but acceptable mode. However, this modeling approach is subject to a number of limitations in terms of efficiency/complexity. More precisely, if a non-identified breakdown occurs, this approach can fail dramatically. A simple way to rectify this situation out is to include this kind of failure into the list of possible regimes. However, this will increase the complexity of the model. Therefore, a compromise must be sought during the modeling phase.
A classical aim of reliability is to study censored survival data. In this context, several parametric, semi-parametric, and non-parametric forms of modeling for survival functions estimation have already been proposed. In this project, we focus our attention on another aspect of reliability: Statistical Quality Control (SQC). More precisely, we wish to develop non-parametric and semi-parametric modeling in order to provide tolerance curves and hyper surfaces.
Tolerance curves are used in industry to predict performance of the manufacturing process from external measures such as temperature or pressure. They are particularly useful when the quality control is late (long manufacture time, intermediate storage ...) or results from small size samples. Tolerance curves provide the inspectors with a tool to control whether the evaluated parameters are within the interval required in the specifications and to make the inspection organization more efficient. Because of their graphical representation, they are particularly easy to use.
A tolerance interval differs from the well-known confidence interval. A confidence interval gives information about the position of the mean value of the parameter
whereas a tolerance interval gives information about the position of the parameter and the probability for this parameter to be in this interval. Let
Ybe the random variable which represents this parameter and
Xthe covariate (temperature, pressure,...). To take these covariate into account in the evaluation of the tolerance interval of
Y, the conditional distribution of
Ygiven
Xis studied. When
Xis in
, the conditional quantiles of
Ygiven
Xare used to build tolerance curves and when
Xis in
, they are used to build tolerance hyper surfaces. Finally when
, several parameters are studied simultaneously and multivariate or spatial conditional quantiles are used to build a tolerance region.
Three types of modeling can be used to define conditional univariate or multivariate quantiles. Parametric modeling has the advantage of giving results easy to interpret but the parametric
shape of the conditional distribution of
Ygiven
Xhas to be predetermined. Non-parametric modeling is more flexible because it relaxes the constraint on the conditional distribution. However, in practice the results are difficult to
interpret. Semi-parametric modeling is therefore a compromise between these two types of modeling and gives results easy to interpret. Real indices
X'are indeed incorporated in order to reduce the dimension of the explicative part of the model and no parametric structure is imposed on the link between
Yand
X'.
The choice of parametric, non-parametric or semi-parametric modeling is thus a key point in the estimation of tolerance curves, hyper surfaces or regions.
Non-parametric conditional quantiles estimation is usually based on kernel or local polynomial estimation methods (see for instance
,
. ) and suffers, like local smoothing methods, from what is called the curse of dimensionality. Indeed, when the
dimension
pof the covariate
Xincreases, the dispersion of the data increases and the quality of the estimation decreases. Another drawback is that graphical representation is possible only when the dimension of
Xis equal to 1 or 2: tolerance curves are obtained in 2D when
and tolerance hyper surfaces are obtained in 3D when
.
To avoid these two drawbacks, semi-parametric modeling can be used. The following two-step (one parametric and one non-parametric) methodology for conditional quantiles estimation is
proposed. First of all, the Euclidean parameter
used to reduce
Xto
X'is estimated. Next, the functional parameters used in the non-parametric conditional quantile estimation are estimated. More precisely the semi-parametric approach combines the SIR
(sliced inverse regression) method and the kernel estimation of the conditional quantile. This methodology allows graphical representation of the tolerance curves in 2D (
) or surfaces in 3D (
), with the index
X'easy to interpret.
We want to focus our attention on dimension-reduction approach for problems of quality control (sliced inverse regression approaches, clustering of variables, ...). We still work on non-parametric and semi-parametric estimation of tolerance curves and hyper-surfaces, via estimation of (multivariate) conditional quantiles. Moreover, another point of interest is the introduction of recursive methods in the estimation process in order to be able to manage the data stream, and at present we develop recursive methods in semi-parametric model for the estimation of tolerance curves.
The following examples illustrate the importance of dependability and safety in various fields.
A first example concerns oil production in deep water. We have already worked with IFREMER on the reliability of oil rigs and with IFP (French Oil Institute) on risk assessment and control for extraction of hydrocarbon substances from submarine deposits hard to work due to their depth.
A second example in the military field concerns combat aircraft with "relaxed static stability". These aircraft are slightly aerodynamically unstable by design: they will quickly depart from level and controlled flight unless the pilot constantly works to keep it in trim. While this enhances maneuverability, it is very wearing on a pilot relying on a mechanical flight control system. Hence, the aircraft is highly vulnerable to sensors or on-board calculator breakdown.
A third example deals with quality control linked with biomedical and biometric studies led by CERIES (Centre de Recherches et Investigations Epidermiques Sensorielles): the research center of CHANEL on human skin. The knowledge of tolerance curves for numerous skin biophysical parameters is crucial for CERIES in so far as it enables CHANEL chemists to develop new cosmetic products better adapted to the aimed target: elderly women on the Asian market, young Caucasian or African-American women. Thus, knowing the skin features of a person is enough to decide whether or not they fit in the reference limits of the various CHANEL cosmetic products.
A last example concerns the development of air quality control strategies, which is a wide preoccupation for human health. In order to achieve this purpose, air pollution sources have to be accurately identified and quantified. We have already worked in 2007-2008 on a scientific project initiated by the French ministry of Ecology and Sustainable Development. We have also worked on the statistical and quality control parts of a study financed by VNF (Voies Navigables de France) concerning a satisfaction survey of sailors on the “canal des deux mers” in south of France. An other possible application concerns the use of tolerance curves in industrial quality control process. We already had discussions with Michelin on this subject.
Most of the statistical methods for dimension reduction and quality control have been implemented in R : variables clustering (Chavent,Kuentz), cluster-SIR (Kuentz, Saracco), bootstrap choice of parameters for SIR related methods (Liquet, Saracco), geometric multivariate (conditional) quantile (Chaouch, Saracco), recursive SIR (Bercu, Nguyen, Saracco), sample selection models (Chavent, Liquet, Saracco), bagging SIR (Kuentz, Liquet, Saracco).
The following results have been obtained in collaboration with Oswaldo Luis Do Valle Costa from Escola Politécnica da Universidade de São Paulo, Brazil. The main goal of this work published
in
is to establish some equivalence results on stability, recurrence and ergodicity between a piecewise deterministic
Markov process (PDMP for short)
{
X(
t)}and an embedded discrete-time Markov chain
{
n}generated by a Markov kernel
Gthat can be explicitly characterized in terms of the three local characteristics of the PDMP, leading to tractable criteria results. First we establish some important results
characterizing
{
n}as a sampling of the PDMP
{
X(
t)}and deriving a connection between the probability of the first return time to a set for the discrete-time Markov chains generated by
Gand the resolvent kernel
Rof the PDMP. From these results we obtain equivalence results regarding irreducibility, existence of
-finite invariant measures, (positive) recurrence and (positive) Harris recurrence between
{
X(
t)}and
{
n}, generalizing the results of
in several directions. Sufficient conditions in terms of a modified Foster-Lyapunov criterion are also presented to
ensure positive Harris recurrence and ergodicity of the PDMP. We illustrate the use of these conditions by showing the ergodicity of a capacity expansion model.
The long run average continuous control problem of piecewise deterministic Markov processes (PDMP's) taking values in a general Borel space and with compact action space depending on the state variable is investigated. The control variable acts on the jump rate and transition measure of the PDMP, and the running and boundary costs are assumed to be positive but not necessarily bounded. As far as we are aware of, this is the first time that this kind of problem is considered in the literature. Indeed, results are available for the long run average cost problem but for impulse control see Costa , Gatarek and the book by M.H.A. Davis (see the references therein). On the other hand, the continuous control problem has been studied only for discounted costs by A. Almudevar , M.H.A. Davis , , M.A.H. Dempster and J.J. Ye , , Forwick, Schäl, and Schmitz , M. Schäl , A.A. Yushkevich , .
We have recently derived the following results :
it has been obtained an optimality equation for the long run average cost in terms of a discrete-time optimality equation related to the embedded Markov chain given by the post-jump location of the PDMP.
the existence of a feedback measurable selector for the discrete-time optimality equation has been shown by establishing a connection between this equation and an integro-differential equation
sufficient conditions has been obtained for the existence of a solution for a discrete-time optimality inequality and an ordinary optimal feedback control for the long run average cost using the so-called vanishing discount approach (see , page 83).
The aim of the three works described in this section is to highlight the potentiality of a method which combines the high modelling ability of the piecewise deterministic Markov processes and the great computing power inherent to the Monte Carlo simulation.
The heated tank problem is a famous case-study of dynamic reliability (see
and references therein). It consists of a tank containing a fluid whose level is controlled by three components :
two inlet pumps (units 1 and 2) and one outlet valve (unit 3). Each component has four possible states : on, off, stuck on or stuck off. A thermal power source heats up the fluid, and the two
continuous variables of interest are the temperature
and liquid level
h. At time
t= 0, the system is in the equilibrium state where both
and
hare constant,
= 30.9261
oC and
h= 7m, units 1 and 3 are on and unit 2 is off. The temperature and level evolve according to a differential equation depending on the state of the three components.
The components change state either spontaneously according to an inhomogeneous Poisson jump process whose intensity depends on the temperature, or deterministically (providing units are not
stuck) when the level reaches a lower (6m) or upper (8m) limit : in the former case, units 1 and 2 are turned on and unit 3 is turned off, in the latter case, units 1 and 2 are turned off and
unit 3 is turned on. We are interested in three possible top events : dry out (
h4m), overflow (
h10m) or hot temperature
100
oC), and their respective cumulative probability at any time. The evolution of this hybrid system can be described by a piecewise deterministic Markov process (PDMP) with explicit
coefficients.
The aim of the work published in is to show the ability of the PDMP approach to solve dynamic reliability problems by applying it to the specific but not trivial example described above. Contrary to existing methods, we develop a simulation method which does not require any time-space discretization.
We also investigated another numerical method to compute the three different failure rates. M.H.A. Davis in proved that the expectation of a certain class of functionals of PDMP is the unique solution of a system of differential equations. The failure probabilities fall into this class. For the heated tank problem and the high temperature probability, we have derived a system of 17 differential equations. Each equation on its own is simple enough, but interesting problems arise with the coupling because the domains of each equation are different and with possibly complicated geometry and non-standard limit conditions. In collaboration with Thierry Colin of team-project MC2, we implemented a numerical code to compute the failure probability.
Finally, a more realistic example of an offshore oil production system has been analyzed in . The results obtained have been compared with those given by an ad hoc Petri net model.
We are developing a computational method for optimal stopping of a piecewise deterministic Markov process by using a quantization technique of Markov chains, see . Optimal stopping problems have been studied for PDMP's in , , , , , . In the author defines an operator related to the first jump time of the process and shows that the value function of the optimal stopping problem is a fixed point for this operator. Based on a probabilistic interpretation for this jump operator, we designed a numerical scheme to approximate the value function. The originality of our work is two-fold. On the one hand, instead of using a fixed time-discretization grid for our continuous-time process, we first compute the quantization of an underlying discrete-time Markov chain which naturally appears in our problem, and only then derive path-adapted time-discretization grids. On the other hand, some of the functions involved in this optimization problem are not regular despite some strong regularity assumptions on the parameters of the problem. Therefore, the approach developed in cannot be directly applied to our problem. However, due to the special structure of PDMP's we are able to overcome this difficulty.
Bifurcating autoregressive (BAR) processes are an adaptation of autoregressive (AR) processes to binary tree structured data. They were first introduced by Cowan and Staudte for cell lineage data, where each individual in one generation gives birth to two offspring in the next generation. Cell lineage data typically consist of observations of some quantitative characteristic of the cells over several generations of descendants from an initial cell. BAR processes take into account both inherited and environmental effects to explain the evolution of the quantitative characteristic under study.
There are several results on statistical inference and asymptotic properties of estimators for BAR models in the literature. For maximum likelihood inference on small independent trees, see Huggins and Basawa . For maximum likelihood inference on a single large tree, see Huggins for the original BAR model, Huggins and Basawa for higher order Gaussian BAR models, and Zhou and Basawa for exponential first-order BAR processes. We also refer the reader to Zhou and Basawa for the least squares (LS) parameter estimation. In all those papers, the BAR process is supposed to be stationary. In Guyon , the LS estimator is also investigated, but the process is not stationary, and the author makes intensive use of the tree structure and Markov chain theory.
We have carried out a sharper analysis of the asymptotic properties of the LS estimators of the unknown parameters of first-order BAR processes and improved the previous results of Guyon via a martingale approach, based on the generation-wise filtration. As previously done by Basawa and Zhou , , , we made use of the strong law of large numbers as well as the central limit theorem , for martingales. It allowed us to go further in the analysis of first-order BAR processes. Namely, we have established the almost sure convergence of our LS estimators with a sharp rate of convergence, together with the quadratic strong law and the central limit theorem.
This work has been submitted to be published , and was presented at the Joint Meeting of the SSC and the SFdS , at the Journées MAS 2008 , and at CISPA 2008 .
Since the pioneer work of Aström and Wittenmark
, a wide range of literature is available on parametric estimation and adaptive tracking for linear regression
models
,
,
,
,
,
. However, only few references may be found on nonparametric estimation in adaptive tracking
,
,
,
. Our goal was to investigate the asymptotic properties of a kernel density estimator associated with the driven
noise of a linear regression in adaptive tracking and to carry out a goodness-of-fit test. More precisely, consider the multivariate ARMAX model of order
(
p,
q,
r)given, for all
n0, by
A(
R)
Xn=
B(
R)
Un+
C(
R)
n
where
Xn,
Unand
nare the
d-dimensional system output, input and driven noise, respectively. Denote by
Rthe shift-back operator and set
where
Ai,
Bj, and
Ckare unknown matrices and
Idis the identity matrice of order
d. For the sake of simplicity, we shall assume that the high frequency gain matrix
B1is known with
B1=
Id. The most common way for estimating the unknown parameter
of the model is to make use of the extended least-squares estimator
. Moreover, the crucial role played by the control
Unis to regulate the dynamic of the process
(
X
n)by forcing
Xnto track step by step a bounded predictable reference trajectory
xn*. We make use of the standard adaptive tracking control
Unproposed by Aström and Wittenmark
. Furthermore, we shall assume that the driven noise
(
n)is a sequence of centered independent and identically distributed random vectors with positive definite covariance matrix
and unknown probability density function denoted by
f.
Our goal was to study the asymptotic properties of a recursive kernel density estimator of
fgiven, for all
n1, by
where the kernel
Kis a chosen density function and the bandwidth
(
h
n)is a sequence of positive real numbers decreasing to zero. Our purpose is first to show that
behaves pretty well as a recursive kernel density estimator of
fin adaptive tracking and secondly to carry out a goodness-of-fit test for
fbased on
. Such a goodness-of-fit test is very popular in time series, in particular for testing the normality hypothesis. However, to the best of our knowledge, no work is concerned with
asymptotic properties of recursive kernel density estimator in adaptive tracking.
We provide an almost sure pointwise and uniform strong law of large numbers for as well as a pointwise and multivariate central limit theorem. We also carry out a goodness-of-fit test together with some simulation experiments.
This work has been recently published to SIAM , and it will be presented at the 47 eIEEE Conference on Decision and Control, Cancun, Mexico, 2008.
We propose a new concept of strong controllability related to the Schur complement of a suitable limiting matrix. This new notion allows us to extend the previous convergence results associated with multidimensional ARX models in adaptive tracking. On the one hand, we carry out a sharp analysis of the almost sure convergence for both least squares and weighted least squares algorithms. On the other hand, we also provide a central limit theorem and a law of iterated logarithm for these two algorithms. Our asymptotic results are illustrated by numerical simulations.
This work has been recently submitted to Automatica , and it will be presented at the 47 eIEEE Conference on Decision and Control, Cancun, Mexico, 2008.
Concerning the dimension reduction framework which is a useful tool for the quality control part of the project, some results have been obtained for Sliced Inverse Regression (SIR) approach.
To reduce the dimensionality of regression problems, sliced inverse regression approaches make it possible to determine linear combinations of a set of explanatory variables
Xrelated to the response variable
Yin general semiparametric regression context. From a practical point of view, the determination of a suitable dimension (number of the linear combination of
X) is important. In the literature, statistical tests based on the nullity of some eigenvalues have been proposed. Another approach is to consider the quality of the estimation of the
effective dimension reduction (EDR) space. The square trace correlation between the true EDR space and its estimate can be used as goodness of estimation. In
, we focus on the SIR
method and propose a naïve bootstrap estimation of the square trace correlation criterion. Moreover, this criterion could also select the
parameter in the SIR
method. We indicate how it can be used in practice. A simulation study is performed to illustrate the behaviour of this approach.
In
, we consider a semiparametric regression model such that the dependent variable
Yis linked to some indices
X'kthrough an unknown link function. SIR-I, SIR-II and SIR
methods were introduced in order to estimate the EDR space spanned by the vectors
k. These methods are computationally fast and simple but are influenced by the choice of slices in the estimation process. In this paper, we suggest to use versions of SIR methods based
on fuzzy clusters instead of slices which can be seen as hard clusters and we exhibit the corresponding algorithm. We illustrate the sample behaviour of the fuzzy inverse regression estimators
and compare them with the SIR ones on simulation study.
In the theory of sufficient dimension reduction, SIR is a famous technique that enables to reduce the dimensionality of regression problems. This semiparametric regression method is based on a linearity condition on the marginal distribution of the predictor x, which appears to be a limitation. In , we propose to cluster the predictor space so that this condition approximately holds in the different partitions. We estimate the dimension reduction subspace by combining the individual estimates of the clusters. We give asymptotic properties of the corresponding estimator and show with a simulation study the numerical performances of cluster-based SIR.
In
we consider a multivariate semiparametric sample selection model and we develop a geometric approach to the
estimation of the slope vectors in the outcome equation and in the selection equation. Contrary to most existing methods, we deal symmetrically with both slope vectors. The estimation method is
link-free and distribution-free, it works in two main steps : a multivariate SIR step, and a Canonical Analysis step. We establish root
n-consistency and asymptotic normality of the estimates. We give results from a simulation study in order to illustrate the estimation method.
More recently, Bercu, Nguyen and Saracco propose a recursive estimator for the matrices of interest in SIR approach. Moreover, when the number
Hof slices is equal to two, we obtain a recursive estimator of the direction of the parameter
in the semiparamtric regression model
y=
f(
x',
)which does not require the estimation of the link function
f. We give asymptotic results for this estimator. A simulation study illustrates the good numerical behaviour of this estimator for moderate sample size even if the dimension of the
covariate
xis important.
In , we consider a regression model in which the variable of interest is censored. We present various nonparametric estimators of the conditional distribution function and of conditional quantiles. In a simulation study, we compare the performance of these estimators. Moreover the local linear estimator of conditional quantile is applied on a dataset dealing with the effect of age on survival time of kidney transplant patients.
In a multidimensional setting, lack of objective basis for ordering multivariate observations is a major problem in extending the notion of quantiles. Conditional quantiles are required in various biomedical or industrial problems. Numerous alternative definitions of (conditional) quantile for multidimensional variables, have been proposed in statistical literature. In we focus on the notion of geometric quantile and conditional geometric quantile, based on the minimization of a loss function. Asymptotics results has been obtained and an implementation in R allowed us to show the good numerical performances of the proposed estimators in practice on simulated and real datasets.
Dimension reduction via clustering of variables.Most of clustering methods have been developed for the clustering of units. Concerning variable clustering, few techniques are available. They can however be useful for many statistical purposes : dimension reduction, selection of variables, etc. In we propose a divisive hierarchical approach for the clustering of quantitative variables, which is based on the VARCLUS procedure (VARiables CLUStering) of SAS software. We extend this approach to qualitative variables and propose a technique to help to choose the number of clusters. Finally we present some results obtained with simulated quantitative data.
Clustering with geographical constraints.Agricultural policies have recently experienced major reformulations and became more and more spatialised. Agri-environmental indicators (AEIs) provide an essential tool for formalizing information from different sources and to address the impact of agricultural production on the environment. An important political issue is currently the implementation of WFD (Water Framework Directive) in European countries. A study is carrying out at Cemagref in the context of the SPICOSA projectand of the implementation of WFD ((Water Framework Directive): the purpose is to define the relevant spatial unit, helpful for the integrated management of the continuum “Pertuis Charentais Sea” and “Charente river basin”. We have to define homogeneous areas within the Charente basin to calculate the spatialised AEIs and to implement an hydrological model (SWAT). The goal is then to partition the hydrological units and to obtain some clusters as homogeneous as possible in order to implement AEIs and the SWAT model. To address this problem, we have implemented a new clustering method. DIVCLUS-T is a descendant hierarchical clustering algorithm based on a monothetic bipartitional approach allowing the dendrogram of the hierarchy to be read as a decision tree. We propose in a new version of this method called C-DIVCLUS-T which is able to take contiguity constraints into account. We apply C-DIVCLUS-T to hydrological areas described by agricultural and environmental variables, in order to take their geographical contiguity into account in the monothetic clustering process.
Uncertainty and variability are often met in practice in data concerned by quality control process. The uncertainty or the variability of the data may be treated by considering, rather than a single value for each data, the interval of values in which it may fall. In we study the derivation of basic description statistics for interval-valued datasets. We propose a geometrical approach in the determination of summary statistics (central tendency and dispersion measures) for interval-valued variables. This approach mimics the case of real-valued variables, with the absolute value of the difference between two real numbers being replaced by a distance between two intervals. We give, when the Hausdorff distance is used to compare two intervals, explicit definitions of the central intervals (and dispersion measures), which generalize to the mean value, the median and the midrange.
Air pollution is a wide concern for human health and requires the development of air quality control strategies. In order to achieve this goal pollution sources have to be accurately identified and quantified. The case study presented in this is part of a scientific project initiated by the French Ministry of Ecology and Sustainable Development. For the following study measurements of chemical composition data for particles have been conducted on a French urban site. The first step of the study consists in the identification of the sources profiles which is achieved through Principal Component Analysis completed by a rotation technique. Then the apportionment of the sources is evaluated with a receptor modeling using Positive Matrix Factorization as estimation method. Finally the joint use of these two statistical methods enables to characterize and apportion five different sources of fine particulate emission.
This new line of research has been started during the period 2007-2008. It is mainly concerned with the design and the analysis of a new class of interacting stochastic algorithms for sampling complex distributions including Boltzmann-Gibbs measures and Feynman-Kac path integral semigroups arising in physics, in biology and in advanced stochastic engineering science.
These interacting sampling methods can be described as adaptive and dynamic simulation algorithms which take advantage of the information carried by the past history to increase the quality of the next series of samples. One critical aspect of this technique as opposed to standard Markov chain Monte Carlo methods is that it provides a natural adaptation and reinforced learning strategy of the physical or engineering evolution equation at hand. This type of reinforcement with the past is observed frequently in nature and society, where beneficial interactions with the past history tend to be repeated. Moreover, in contrast to more traditional mean field type particle models and related sequential Monte Carlo techniques, these stochastic algorithms can increase the precision and performance of the numerical approximations iteratively.
The origins of these interacting sampling methods can be traced back to a pair of articles , by P. Del Moral and L. Miclo. These studies are concerned with biology-inspired self-interacting Markov chain models with applications to genetic type algorithms involving a competition between the natural reinforcement mechanisms and the potential attraction of a given exploration landscape.
In the period 2007-2008, these lines of research have been developed in three different directions :
The use of self-interacting Markov chain methods to Markov chain Monte Carlo methodology has been developed in a series of joint articles of C. Andrieu, P. Del Moral, A. Doucet and A. Jasra , , , , as well as in the more recent article of A. Brockwell, P. Del Moral and A. Doucet . Related ideas have also appeared in computational chemistry .
In a more recent article , we develop a new class of interacting Markov chain Monte Carlo models for solving numerically general discrete-time measure-valued equations. The associated stochastic processes belong to the class of self-interacting Markov chains. In contrast to traditional Markov chains, their time evolution may depend on the occupation measure of the past values. This general methodology allows us to provide a natural way to sample from a sequence of target probability measures of increasing complexity. We also design an original theoretical analysis of the behaviour of these algorithms as the time parameter tends to infinity. This analysis relies on measure-valued processes and semigroup techniques. We also present a variety of convergence results including exponential estimates and a uniform convergence theorem with respect to the number of target distributions, yielding what seems to be the first results of this kind for this class of self-interacting models.
Functional central limit theorems are developed in a pair of articles
,
. This pair of articles provide an original stochastic analysis based on semigroup techniques on distribution
spaces and fluctuation theorems for self-interaction random fields. Besides the fluctuation analysis of these models, we also present a series of sharp
Lm-mean error bounds in terms of the semigroup associated with the first order expansion of the limiting measure valued process, yielding what seems to be the first results of this
type for this class of interacting processes.
The design and the mathematical analysis of genetic type and branching particle interpretations of Feynman-Kac-Schroedinger type semigroups ( and vice versa) has been developed by group of researchers since the beginning of the 90's. In Bayesian statistics these sampling technology is also called sequential Monte Carlo methods. For further details, we refer to the books , , and references therein.
This Feynman-Kac particle methodology is increasingly identified with emerging subjects of physics, biology, and engineering science. This new theory on genetic type branching and interacting particle systems has led to spectacular results in signal processing , , , and in quantum chemistry with precise estimates of the top eigenvalues, and the ground states of Schrodinger operators , , . It offers a rigorous and unifying mathematical framework to analyze the convergence of a variety of heuristic-like algorithms currently used in biology, physics and engineering literature since the beginning of the 1950's. It applies to any stochastic engineering problem which can be translated in terms of functional Feynman-Kac type measures.
During the last two decades the range of application of this modern approach to Feynman-Kac models has increased revealing unexpected applications in a number of scientific disciplines including in :
The analysis of Dirichlet problems with boundary conditions, financial mathematics, molecular analysis, rare events and directed polymers simulation, genetic algorithms, Metropolis-Hastings type models, as well as filtering problems and hidden Markov chains.
In the period 2007-2008, these lines of research have been developed in four different directions :
In , we present a mean field particle theory for the numerical approximation of Feynman-Kac path integrals in the context of nonlinear filtering. We show that the conditional distribution of the signal paths given a series of noisy and partial observation data is approximated by the occupation measure of a genealogical tree model associated with mean field interacting particle model. The complete historical model converges to the McKean distribution of the paths of a nonlinear Markov chain dictated by the mean field interpretation model. We also review the stability properties and the asymptotic analysis of these interacting processes, including fluctuation theorems and large deviation principles. We also present an original Laurent type and algebraic tree-based integral representations of particle block distributions. These sharp and non asymptotic propagations of chaos properties seem to be the first result of this type for mean field and interacting particle systems. Strong propagations of chaos theorems for continuous time models are also presented in the article .
In , we analyze the convergence of a class of sequential Monte Carlo methods where the times at which resampling occurs are computed on-line using criteria such as the effective sample size. This is a popular approach amongst practitioners but there are very few convergence results available for these methods. It is shown here that these genetic type algorithms correspond to a particle approximation of a Feynman-Kac flow of measures on adaptive excursion spaces. By combining a non-linear distribution flow analysis to an original coupling technique, we obtain functional central limit theorems and uniform exponential concentration estimates for these algorithms. The original exponential concentration theorems presented in this study significantly improve previous concentration estimates obtained for sequential Monte Carlo algorithms.
In
, we present a non asymptotic theorem for interacting particle approximations of unnormalized Feynman-Kac
models. We provide an original stochastic analysis based on Feynman-Kac semigroup techniques combined with recently developed coalescent tree-based functional representations of particle
block distributions. We present some regularity conditions under which the
L2-relative error of these weighted particle measures grows linearly with respect to the time horizon yielding what seems to be the first results of this type for this class of
unnormalized models. We also illustrate these results in the context of particle simulation of static Boltzmann-Gibbs measures and restricted distributions, with a special interest in rare
event analysis.
In , we analyze the long time behavior of neutral genetic population models, with fixed population size. We design an explicit, finite, exact, genealogical tree based representation of stationary populations that holds both for finite and infinite types (or alleles) models. We then analyze the decays to the equilibrium of finite populations in terms of the convergence to stationarity of their first common ancestor. We estimate the Lyapunov exponent of the distribution flows with respect to the total variation norm. We give bounds on these exponents only depending on the stability with respect to mutation of a single individual; they are inversely proportional to the population size parameter.
We are working on the statistical and quality control parts (sampling and data collection, survey weights, sampling errors) of a study financed by VNF (Voies Navigables de France) concerning a satisfaction survey of sailors on the “canal des deux mers” in south of France (total value of the contract: 6000 euros).
The goal of this project is to propose and study an approach to evaluate the probability of occurrence of events defined by the crossing of a threshold.
In the framework of this grant, we have studied the properties of different stochastic models of propagation of cracks. We have taken a particular interest in stochastic models involving the physical Paris law of propagation. We have proposed a new model of PDMP based on the Paris law. This model will allow the change of regimes in the propagation when for instance the crack length reach a given threshold.
This work gave rise to the technical report "RAF1897/07" "Modèles stochastiques pour la propagations de fissures" .
B. Bercu, P. Del Moral and Aurélie Le Cain are starting a research project with the CEA CESTA on the statistical modeling of electromagnetic fields associated to laser beam superpositions (with a total amount of 8K euros and the CEA CESTA PhD-grant funding of Aurélie Le Cain).
P. Del Moral is starting a research project with P. Minvielle of the CEA CESTA on inverse problems in Radar SRA/ISAR imagery processing. More details can be found in the master level projectand the PhD. proposalwebsites. This PhD. proposal is funded by the CEA CESTA.
Jérome Saracco is the leader of a research project financed by the Région Aquitaine for three years (2007-2009), named Estimation recursive pour des modèles semiparamétriques en Statistiquewith a total amount of 120 000 euros including the PhD-grant of Thi Mong Ngoc Nguyen.
P. Del Moral has obtained a Projet Région Aquitaine 2008, volet Recherche named "Méthodes de Monte Carlo par Chaînes de Markov en Interaction" with a total amount of 74K euros including a PhD-grant and a Post-doc grant. P. Del Moral is a member or the scientific leader in Bordeaux of 5 ANR projects:
ANR SYSCOMM Viroscopyon epidemic propagations models and analysis (2009-2012).
ANR PREVASSEMBLEon Forecasting and Data assimilation (2009-2012).
ANR MODECOLon virtual prairie models (2009-2012).
ANR NEBBIANOon the Security and Reliability in Digital Watermarking (2007-2010).
P. Del Moral is also at the origin and he is a member the ANR project Probabilité et Interaction. More details can be found in the website of Chaire d'excellence de P. Diaconis, the program is running from 2006 to 2009.
P. Del Moral belongs the the INRIA Cooperative research action INRIA ARC RARE 2007.
Since 2007, P. Del Moral has a joint research project INRIA-INRA on critical bacteriology proliferation studies with J.P. Vila in the INRA Montpellier.
P. Del Moral is the leader of an INRIA associated team project, named 2AS, with the team of Professor Wu Liming in Wuhan University in China. More details can be found in the website archive.
B. Bercu belongs to the MAS thematic group of the SMAI.
B. Bercu is a member of the UFR council as well as a member of IMB council of the University of Bordeaux 1.
B. Bercu is the director of applied mathematic department of the University of Bordeaux 1.
M. Chavent is webmaster of the web site of the SFDS (Société Française de Statistique).
M. Chavent is a member of the scientific commitee of the following conferences EGC' 08.
M. Chavent is a member of the administration council of the SFDS (Société Française de Statistique).
P. Del Moral is an elected member of the International Statistical Institute since 2006.
P. Del Moral was in the MAS thematic group of the SMAI (industrial and applied mathematical society) from 2005 to 2008.
P. Del Moral is a member of the SMAI since 2005.
P. Del Moral is chief editor of the journal : ESAIM: Proceedingssince 2006.
P. Del Moral is an associate editor of the journal : Stochastic Processes and their Applicationssince 2006.
P. Del Moral is an associate editor of the journal : Stochastic Analysis and Applicationssince 2001.
P. Del Moral and Nicolas Hadjiconstantinou (MIT) are the Guest Editors M2AN for a 2010 special Volume on Probabilistic Methods.
F. Dufour is a member of the SMAI since 2007.
F. Dufour is the head of the second year cursus of the engineering school MATMECA.
F. Dufour is a member of the administration council of the engineering school MATMECA.
A. Gégout-Petit is in charge to promote the "Licence MASS" (Applied mathematics degree) of the University of Bordeaux 2 to the secondary school pupils.
A. Gégout-Petit was member of the Selection Committee (MCF 0982) for the university Victor Segalen 2009.
A. Gégout-Petit is reviewer for Journal of Statistical Planning Inference, Journal of theoretical Biology, Theoretical Biology and Medical Modelling.
A. Gégout-Petit was expert for the ANR project SYSCOMM.
B. de Saporta belongs to the MAS thematic group of the SMAI (industrial and applied mathematical society). The main purpose of this group is to promote probability and statistics in the applied mathematic community.
B. de Saporta was a member of the Commission de Spécialistes, section 26, of the University of Bordeaux 1 in 2008.
B. de Saporta is a member of the organization commitee of Enigmath 2008, a free mathematical quizz on line for all public.
http://
B. de Saporta is a regular reviewer for Mathematical Reviews.
J. Saracco is a member of the administration council of the SFC (Société Francophone de Classification).
J. Saracco is a member of the administration council of the University of Bordeaux 4.
J. Saracco is a reviewer for Computational Statistics and Data Analysis, Journal of Multivariate Analysis, The Annals of Statistics, Statistica Sinica, Biometrika, Comptes Rendus de l'Académie des Sciences.
The team CQFD is very involved in the organization of the "41e Journées de Statistique"Annual event of the "Société Française de Statistique" which will take place in Bordeaux in 2009.
B. Bercu is the organizer of the "Fifth Meeting of Mathematical Statistics"between the Universities of Bordeaux, Montpellier, Santander, Valladolid, Pau, and Toulouse in June 2009.
P. Del Moral is the organizer of the first " Journées de Probabilités et Statistique de Bordeaux" with the CEA CESTAin october 2008. More information on this workshop is available via the web-page :
http://
B. Bercu gave contributed talks at
Workshop on limit theorems, Paris, January 2008.
Alea meeting, Marseille-Luminy, March 2008.
Première Rencontre des Instituts de Mathématiques de Bordeaux, Montpellier, Pau, Toulouse, September 2008.
47e IEEE CDC Conference, Cancun, Mexico, December 2008.
Colloque CISPA Statistique des processus et Applications, Constantine, Algeria, October 2009.
F. Caron gave a contributed talk at the International Conference on Machine Learning, Helsinki, Finland, July 2008.
M. Chavent gave contributed talks at
AGROSTAT 2008 (10èmes journées Européennes Agro-industrie et Méthodes statistiques), Louvain-la-neuve, Belgium, January 2008.
40èmes journées de Statistique, Ottawa, June 2008.
M. Chavent gave an invited talk at COMPSTAT 08 (18th International Conference on Computational Statistics), Porto, Portugal, August 2008.
In 2008, P. Del Moral has been invited to give a series of tutorials in four national and international research programs :
Rencontres GdR MASCOT-NUM, CEA Cadarache (March 12-14 2008). ( 3h lectures, slides to download)
Computational Mathematics, Numerical methods in molecular simulation Bonn (April 7-11th 2008). ( 3h lectures, slides to download)
Opening workshop, Sequential Monte Carlo Methods, SAMSI, Duke Univ. (September 7-10th 2008). ( 2h lectures, slides to download)
10th Machine Learning summer school, Ile de Ré (September 1-15th 2008). ( 6h lectures, slides to download)
In 2008, P. Del Moral has been invited to give a series of lectures in a dozen of International Conferences and in a national colloquium :
Colloquium du MAP5, Universite Paris Descartes, (March 28th, 2008). ( slides to download)
Rencontres EDP/Probasà l'Institut Henri Poincaré, Paris (March 2008) ( slides to download)
Conference on Monte Carlo methods Theory and applications,Brown University (April 25-26, 2008). ( slides to download)
Workshop on Probability and Statistics, Wuhan University (May 30, 2008). ( slides to download)
Workshop on Probability and Statistics, Capital Normal University, Beijing (June 2nd 2008). ( slides to download)
Colloque : Modélisation pour les Ressources Naturelles, INRA Montpellier (June 18-20 2008). ( slides to download)
Analysis and Probability in Nice, Nice (June 23-28 2008). ( slides to download)
International Workshop: Credit Risk, Evry (June 25-27 2008). ( slides to download)
SIAM Conference on Nonlinear Waves and Coherent Structures, session Mathematical modeling of optical communication systems, Universita di Roma La Sapienza (July 21-24 2008).
Workshop on Numerics and Stochastics, Helsinki University (August 25-29 2008). ( slides to download)
Stochastic Analysis Seminar, Oxford University (December 1st, 2008).
Stochastic Analysis Seminar, Warwick University (December 3rd, 2008).
F. Dufour gave a contributed talk at the - 16, Avignon, October 2008.
F. Dufour gave an invited talk to the Monash-Ritsumeikan Symposium on Probability and Related Fields, Melbourne (December 2008).
F. Dufour has been invited two weeks at Escola Politecnica, Universidade de Sao Paulo, Brasil (June, 2008).
F. Dufour has been invited two weeks at the Centre for Modelling of Stochastic Systems, Department of Mathematical Sciences, Monash University, Melbourne, Australia (December 2008).
A. Gégout-Petit gave contributed talks at
"40e journées de Statistique", Ottawa, May 2008.
"Journées MAS", Rennes, August 2008.
B. de Saporta organized a session of talks for Jounées MAS 2008 in Rennes in august.
M. Saracco gave contributed talks at
AGROSTAT 2008 (10èmes journées Européennes Agro-industrie et Méthodes statistiques), Louvain-la-neuve, Belgium, January 2008.
40èmes journées de Statistique, Ottawa, June 2008.
COMPSTAT 08 (18th International Conference on Computational Statistics), Porto, Portugal, August 2008.
M. Chavent, A. Gégout-Petit, B. de Saporta, F. Dufour, H. Zhang, P. Del Moral, Y. Dutuit, B. Bercu and J. Saracco teach graduate probability and Statistics in the cursus "Statistique et Fiabilité" (Statistic and Reliability) of the Master "Ingénierie Mathématique Statistique et Economique" at the Universities of Bordeaux 1, 2 and 4 .
M. Chavent and A. Gégout-Petit teach Statistic in licence MASS of University of Bordeaux 2.
A. Gégout-Petit is an academic tutor for students work experiences in company or Research Organism.
B. de Saporta teaches undergraduate mathematics and postgraduate probability and finance at the university of economics Montesquieu Bordeaux 4.
J. Saracco teaches statistics in Magistère d'Economie et de Finance Internationale (MAGEFI), University Bordeaux 4 and linear algebra and Statistics in Licence of Economics, University Bordeaux 4.