EN FR
EN FR


Section: New Results

Modern methods of data analysis

Participants: H. Cardot, P. Cénac, O. Collignon, J-M. Monnez, P. Vallois.

In 2011, our contributions to data analysis in a Biological context are twofold:

  • At a theoretical level, we have kept on working on the so-called online data analysis alluded to at the Scientific Foundations Section. Specifically, we have carried on the construction of a fast and recursive algorithm for clustering large data sets with the k-medians methods.

  • At a practical level, our efforts have focused on an interesting study concerning peanuts allergy, for which our expertise in data analysis allows for a good prediction of allergy severity by means of rigorous methods.

Let us now describe more precisely our articles:

(i) A fast and recursive algorithm for clustering large data sets with k-medians. Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen [56] , who introduced a sequential version of the k-means algorithm, a new class of recursive stochastic gradient algorithms designed for the k-medians loss criterion is proposed in [16] , [17] . By their recursive nature, these algorithms are very fast and well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as k-means, trimmed k-means and PAM (partitioning around medoids). Finally, this new on-line clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audience measured every minute over a period of 24 hours.

(ii) Discriminant analyses of peanut allergy severity scores. Peanut allergy is one of the most prevalent food allergies. The possibility of a lethal accidental exposure and the persistence of the disease make it a public health problem. Evaluating the intensity of symptoms is accomplished with a double blind placebo-controlled food challenge (DBPCFC), which scores the severity of reactions and measures the dose of peanut that elicits the first reaction. Since DBPCFC can result in life-threatening responses, we propose in [2] an alternate procedure with the long-term goal of replacing invasive allergy tests. Discriminant analysis of DBPCFC score, the eliciting dose and the first accidental exposure score were performed in 76 allergic patients using 6 immunoassays and 28 skin prick tests. A multiple factorial analysis was performed to assign equal weights to both groups of variables, and predictive models were built by cross-validation with linear discriminant analysis, k-nearest neighbors, classification and regression trees, penalized support vector machine, stepwise logistic regression and Adaboost methods. We developed an algorithm for simultaneously clustering eliciting doses and selecting discriminant variables. Our main conclusion is that antibody measurements offer information on the allergy severity, especially those directed against rAra-h1 and rAra-h3. Further independent validation of these results and the use of new predictors will help extend this study to clinical practices.