EN FR
EN FR
MAASAI - 2025

2025​‌Activity reportProject-TeamMAASAI​​

RNSR: 202023544J
  • Research center​​​‌Inria Centre at Université​ Côte d'Azur
  • In partnership​‌ with:Université Côte d'Azur​​
  • Team name: Models and​​​‌ Algorithms for Artificial Intelligence​
  • In collaboration with:Laboratoire​‌ informatique, signaux systèmes de​​ Sophia Antipolis (I3S), Laboratoire​​​‌ Jean-Alexandre Dieudonné (JAD)

Creation​ of the Project-Team: 2020​‌ February 01

Each year,​​ Inria research teams publish​​​‌ an Activity Report presenting​ their work and results​‌ over the reporting period.​​ These reports follow a​​​‌ common structure, with some​ optional sections depending on​‌ the specific team. They​​ typically begin by outlining​​​‌ the overall objectives and​ research programme, including the​‌ main research themes, goals,​​ and methodological approaches. They​​​‌ also describe the application​ domains targeted by the​‌ team, highlighting the scientific​​ or societal contexts in​​​‌ which their work is​ situated.

The reports then​‌ present the highlights of​​ the year, covering major​​​‌ scientific achievements, software developments,​ or teaching contributions. When​‌ relevant, they include sections​​ on software, platforms, and​​​‌ open data, detailing the​ tools developed and how​‌ they are shared. A​​ substantial part is dedicated​​​‌ to new results, where​ scientific contributions are described​‌ in detail, often with​​ subsections specifying participants and​​​‌ associated keywords.

Finally, the​ Activity Report addresses funding,​‌ contracts, partnerships, and collaborations​​ at various levels, from​​​‌ industrial agreements to international​ cooperations. It also covers​‌ dissemination and teaching activities,​​ such as participation in​​​‌ scientific events, outreach, and​ supervision. The document concludes​‌ with a presentation of​​ scientific production, including major​​​‌ publications and those produced​ during the year.

Keywords​‌

Computer Science and Digital​​ Science

  • A3.1. Data
  • A3.1.10.​​​‌ Heterogeneous data
  • A3.1.11. Structured​ data
  • A3.4. Machine learning​‌ and statistics
  • A9. Artificial​​ intelligence
  • A9.2. Machine learning​​​‌
  • A9.2.1. Supervised learning
  • A9.2.2.​ Unsupervised learning
  • A9.2.6. Neural​‌ networks
  • A9.2.7. Kernel methods​​
  • A9.2.8. Deep learning

Other​​​‌ Research Topics and Application​ Domains

  • B3.6. Ecology
  • B3.6.1.​‌ Biodiversity
  • B6.3.4. Social Networks​​
  • B7.2.1. Smart vehicles
  • B8.2.​​​‌ Connected city
  • B9.6. Humanities​

1 Team members, visitors,​‌ external collaborators

Research Scientists​​

  • Pierre-Alexandre Mattei [INRIA​​​‌, Researcher]
  • Remy​ Sun [INRIA,​‌ ISFP]

Faculty Members​​

  • Charles Bouveyron [Team​​ leader, INRIA &​​​‌ UNIV COTE D'AZUR,‌ Professor, HDR]‌​‌
  • Marco Corneli [UNIV​​ COTE D'AZUR, Associate​​​‌ Professor]
  • Arnaud Droit‌ [INRIA, Professor‌​‌, from Feb 2025​​ until Mar 2025]​​​‌
  • Frederic Precioso [UNIV‌ COTE D'AZUR, Professor‌​‌]
  • Michel Riveill [​​UNIV COTE D'AZUR,​​​‌ Professor]
  • Vincent Vandewalle‌ [UNIV COTE D'AZUR‌​‌, Professor, HDR​​]

PhD Students

  • Davide​​​‌ Adamo [CNRS]‌
  • Elisa Ancarani [UNIV‌​‌ COTE D'AZUR, from​​ Oct 2025]
  • Kilian​​​‌ Burgi [UNIV COTE‌ D'AZUR, until Oct‌​‌ 2025]
  • Célia Dcruz​​ [UNIV COTE D'AZUR​​​‌]
  • Dieu-Donné Fangnon [‌CNRS, from Apr‌​‌ 2025]
  • Mariam Grigoryan​​ [UNIV COTE D'AZUR​​​‌]
  • Maya Guy [‌UNIV COTE D'AZUR,‌​‌ from Mar 2025]​​
  • Nicolas Lacroix [UNIV​​​‌ COTE D'AZUR, from‌ Feb 2025]
  • Seydina‌​‌ Ousmane Niang [UNIV​​ COTE D'AZUR]
  • Raphael​​​‌ Razafindralambo [INRIA,‌ from Feb 2025]‌​‌
  • Raphael Razafindralambo [INRIA​​, until Jan 2025​​​‌]
  • Julie Tores [‌UNIV COTE D'AZUR]‌​‌

Technical Staff

  • Mathieu Lacage​​ [INRIA, Engineer​​​‌, from Jun 2025‌]
  • Axel Velez [‌​‌INRIA, Engineer,​​ from Apr 2025]​​​‌
  • Li Yang [CNRS‌, Engineer, until‌​‌ Jun 2025]

Interns​​ and Apprentices

  • Muhammad Hasham​​​‌ Waseem Abbasi [INRIA‌, Intern, from‌​‌ Apr 2025 until Aug​​ 2025]
  • Bence Zsolt​​​‌ Beregi [UNIV COTE‌ D'AZUR, Intern,‌​‌ from Mar 2025 until​​ Sep 2025]
  • Juliette​​​‌ Girardin [UNIV COTE‌ D'AZUR, Intern,‌​‌ from Dec 2025]​​
  • Juliette Girardin [UNIV​​​‌ COTE D'AZUR, from‌ May 2025 until Sep‌​‌ 2025]
  • Theo Millot​​ [INRIA, Intern​​​‌, from Apr 2025‌ until Sep 2025]‌​‌
  • Vedang Bhupesch Shenvi Nadkarni​​ [INRIA, Intern​​​‌, until May 2025‌]
  • Zhiqiang Yu [‌​‌UNIV COTE D'AZUR,​​ from Jul 2025 until​​​‌ Sep 2025]

Administrative‌ Assistant

  • Claire Senica [‌​‌INRIA]

Visiting Scientists​​

  • Félix Mejia Cajica [​​​‌Univ Santander, until‌ Jan 2025]
  • Federico‌​‌ Raspanti [UNIV COTE​​ D'AZUR]
  • Pradeep Singh​​​‌ [IIIT Delhi,‌ from Aug 2025 until‌​‌ Nov 2025]
  • Sabrina​​ Villata [Univ Torino​​​‌, from Sep 2025‌ until Nov 2025]‌​‌

External Collaborators

  • Alexandre Destere​​ [CHU NICE]​​​‌
  • Pierre Latouche [UNIV‌ CLERMONT AUVERG]
  • Diane‌​‌ Lingrand [UNIV COTE​​ D'AZUR, until Aug​​​‌ 2025]

2 Overall‌ objectives

Artificial intelligence has‌​‌ become a key element​​ in most scientific fields​​​‌ and is now part‌ of everyone life thanks‌​‌ to the digital revolution.​​ Statistical, machine and deep​​​‌ learning methods are involved‌ in most scientific applications‌​‌ where a decision has​​ to be made, such​​​‌ as medical diagnosis, autonomous‌ vehicles or text analysis.‌​‌ The recent and highly​​ publicized results of artificial​​​‌ intelligence should not hide‌ the remaining and new‌​‌ problems posed by modern​​ data. Indeed, despite the​​​‌ recent improvements due to‌ deep learning, the nature‌​‌ of modern data has​​​‌ brought new specific issues.​ For instance, learning with​‌ high-dimensional, atypical (networks, functions,​​ …), dynamic, or heterogeneous​​​‌ data remains difficult for​ theoretical and algorithmic reasons.​‌ The recent establishment of​​ deep learning has also​​​‌ opened new questions such​ as: How to learn​‌ in an unsupervised or​​ weakly-supervised context with deep​​​‌ architectures? How to design​ a deep architecture for​‌ a given situation? How​​ to learn with evolving​​​‌ and corrupted data?

To​ address these questions, the​‌ Maasai team focuses on​​ topics such as unsupervised​​​‌ learning, theory of deep​ learning, adaptive and robust​‌ learning, and learning with​​ high-dimensional or heterogeneous data.​​​‌ The Maasai team conducts​ a research that links​‌ practical problems, that may​​ come from industry or​​​‌ other scientific fields, with​ the theoretical aspects of​‌ Mathematics and Computer Science.​​ In this spirit, the​​​‌ Maasai project-team is totally​ aligned with the “Core​‌ elements of AI” axis​​ of the Institut 3IA​​​‌ Côte d’Azur. It is​ worth noticing that the​‌ team hosts three 3IA​​ chairs of the Institut​​​‌ 3IA Côte d’Azur, as​ well as several PhD​‌ students funded by the​​ Institut.

3 Research program​​​‌

Within the research strategy​ explained above, the Maasai​‌ project-team aims at developing​​ statistical, machine and deep​​​‌ learning methodologies and algorithms​ to address the following​‌ four axes.

Unsupervised learning​​

The first research axis​​​‌ is about the development​ of models and algorithms​‌ designed for unsupervised learning​​ with modern data. Let​​​‌ us recall that unsupervised​ learning — the task​‌ of learning without annotations​​ — is one of​​​‌ the most challenging learning​ challenges. Indeed, if supervised​‌ learning has seen emerging​​ powerful methods in the​​​‌ last decade, their requirement​ for huge annotated data​‌ sets remains an obstacle​​ for their extension to​​​‌ new domains. In addition,​ the nature of modern​‌ data significantly differs from​​ usual quantitative or categorical​​​‌ data. We ambition in​ this axis to propose​‌ models and methods explicitly​​ designed for unsupervised learning​​​‌ on data such as​ high-dimensional, functional, dynamic or​‌ network data. All these​​ types of data are​​​‌ massively available nowadays in​ everyday life (omics data,​‌ smart cities, ...) and​​ they remain unfortunately difficult​​​‌ to handle efficiently for​ theoretical and algorithmic reasons.​‌ The dynamic nature of​​ the studied phenomena is​​​‌ also a key point​ in the design of​‌ reliable algorithms.

On the​​ one hand, we direct​​​‌ our efforts towards the​ development of unsupervised learning​‌ methods (clustering, dimension reduction)​​ designed for specific data​​​‌ types: high-dimensional, functional, dynamic,​ text or network data.​‌ Indeed, even though those​​ kinds of data are​​​‌ more and more present​ in every scientific and​‌ industrial domains, there is​​ a lack of sound​​​‌ models and algorithms to​ learn in an unsupervised​‌ context from such data.​​ To this end, we​​​‌ have to face problems​ that are specific to​‌ each data type: How​​ to overcome the curse​​​‌ of dimensionality for high-dimensional​ data? How to handle​‌ multivariate functional data /​​ time series? How to​​​‌ handle the activity length​ of dynamic networks? On​‌ the basis of our​​ recent results, we ambition​​ to develop generative models​​​‌ for such situations, allowing‌ the modeling and the‌​‌ unsupervised learning from such​​ modern data.

On the​​​‌ other hand, we focus‌ on deep generative models‌​‌ (statistical models based on​​ neural networks) for clustering​​​‌ and semi-supervised classification. Neural‌ network approaches have demonstrated‌​‌ their efficiency in many​​ supervised learning situations and​​​‌ it is of great‌ interest to be able‌​‌ to use them in​​ unsupervised situations. Unfortunately, the​​​‌ transfer of neural network‌ approaches to the unsupervised‌​‌ context is made difficult​​ by the huge amount​​​‌ of model parameters to‌ fit and the absence‌​‌ of objective quantity to​​ optimize in this case.​​​‌ We therefore study and‌ design model-based deep learning‌​‌ methods that can handle​​ unsupervised or semi-supervised problems​​​‌ in a statistically grounded‌ way.

Finally, we also‌​‌ aim at developing explainable​​ unsupervised models that can​​​‌ ease the interaction with‌ the practitioners and their‌​‌ understanding of the results.​​ There is an important​​​‌ need for such models,‌ in particular when working‌​‌ with high-dimensional or text​​ data. Indeed, unsupervised methods,​​​‌ such as clustering or‌ dimension reduction, are widely‌​‌ used in application fields​​ such as medicine, biology​​​‌ or digital humanities. In‌ all these contexts, practitioners‌​‌ are in demand of​​ efficient learning methods which​​​‌ can help them to‌ make good decisions while‌​‌ understanding the studied phenomenon.​​ To this end, we​​​‌ aim at proposing generative‌ and deep models that‌​‌ encode parsimonious priors, allowing​​ in turn an improved​​​‌ understanding of the results.‌

Understanding (deep) learning models‌​‌

The second research axis​​ is more theoretical, and​​​‌ aims at improving our‌ understanding of the behavior‌​‌ of modern machine learning​​ models (including, but not​​​‌ limited to, deep neural‌ networks). Although deep learning‌​‌ methods and other complex​​ machine learning models are​​​‌ obviously at the heart‌ of artificial intelligence, they‌​‌ clearly suffer from an​​ overall weak knowledge of​​​‌ their behavior, leading to‌ a general lack of‌​‌ understanding of their properties.​​ These issues are barriers​​​‌ to the wide acceptance‌ of the use of‌​‌ AI in sensitive applications,​​ such as medicine, transportation,​​​‌ or defense. We aim‌ at combining statistical (generative)‌​‌ models with deep learning​​ algorithms to justify existing​​​‌ results, and allow a‌ better understanding of their‌​‌ performances and their limitations.​​

We particularly focus on​​​‌ researching ways to understand,‌ interpret, and possibly explain‌​‌ the predictions of modern,​​ complex machine learning models.​​​‌ We both aim at‌ studying the empirical and‌​‌ theoretical properties of existing​​ techniques (like the popular​​​‌ LIME), and at developing‌ new frameworks for interpretable‌​‌ machine learning (for example​​ based on deconvolutions or​​​‌ generative models). Among the‌ relevant application domains in‌​‌ this context, we focus​​ notably on text and​​​‌ biological data.

Another question‌ of interest is: what‌​‌ are the statistical properties​​ of deep learning models​​​‌ and algorithms? Our goal‌ is to provide a‌​‌ statistical perspective on the​​ architectures, algorithms, loss functions​​​‌ and heuristics used in‌ deep learning. Such a‌​‌ perspective can reveal potential​​ issues in exisiting deep​​​‌ learning techniques, such as‌ biases or miscalibration. Consequently,‌​‌ we are also interested​​​‌ in developing statistically principled​ deep learning architectures and​‌ algorithms, which can be​​ particularly useful in situations​​​‌ where limited supervision is​ available, and when accurate​‌ modeling of uncertainties is​​ desirable.

Adaptive and Robust​​​‌ Learning

The third research​ axis aims at designing​‌ new learning algorithms which​​ can learn incrementally, adapt​​​‌ to new data and/or​ new context, while providing​‌ predictions robust to biases​​ even if the training​​​‌ set is small.

For​ instance, we have designed​‌ an innovative method of​​ so-called cumulative learning, which​​​‌ allows to learn a​ convolutional representation of data​‌ when the learning set​​ is (very) small. The​​​‌ principle is to extend​ the principle of Transfer​‌ Learning, by not only​​ training a model on​​​‌ one domain to transfer​ it once to another​‌ domain (possibly with a​​ fine-tuning phase), but to​​​‌ repeat this process for​ as many domains as​‌ available. We have evaluated​​ our method on mass​​​‌ spectrometry data for cancer​ detection. The difficulty of​‌ acquiring spectra does not​​ allow to produce sufficient​​​‌ volumes of data to​ benefit from the power​‌ of deep learning. Thanks​​ to cumulative learning, small​​​‌ numbers of spectra acquired​ for different types of​‌ cancer, on different organs​​ of different species, all​​​‌ together contribute to the​ learning of a deep​‌ representation that allows to​​ obtain unequalled results from​​​‌ the available data on​ the detection of the​‌ targeted cancers. This extension​​ of the well-known Transfer​​​‌ Learning technique can be​ applied to any kind​‌ of data.

We also​​ investigate active learning techniques.​​​‌ We have for example​ proposed an active learning​‌ method for deep networks​​ based on adversarial attacks.​​​‌ An unlabeled sample which​ becomes an adversarial example​‌ under the smallest perturbations​​ is selected as a​​​‌ good candidate by our​ active learning strategy. This​‌ does not only allow​​ to train incrementally the​​​‌ network but also makes​ it robust to the​‌ attacks chosen for the​​ active learning process.

Finally,​​​‌ we address the problem​ of biases for deep​‌ networks by combining domain​​ adaptation approaches with Out-Of-Distribution​​​‌ detection techniques.

Learning with​ heterogeneous and corrupted data​‌

The last research axis​​ is devoted to making​​​‌ machine learning models more​ suitable for real-world, "dirty"​‌ data. Real-world data rarely​​ consist in a single​​​‌ kind of Euclidean features,​ and are genereally heterogeneous.​‌ Moreover, it is common​​ to find some form​​​‌ of corruption in real-world​ data sets: for example​‌ missing values, outliers, label​​ noise, or even adversarial​​​‌ examples.

Heterogeneous and non-Euclidean​ data are indeed part​‌ of the most important​​ and sensitive applications of​​​‌ artificial intelligence. As a​ concrete example, in medicine,​‌ the data recorded on​​ a patient in an​​​‌ hospital range from images​ to functional data and​‌ networks. It is obviously​​ of great interest to​​​‌ be able to account​ for all data available​‌ on the patients to​​ propose a diagnostic and​​​‌ an appropriate treatment. Notice​ that this also applies​‌ to autonomous cars, digital​​ humanities and biology. Proposing​​​‌ unified models for heterogeneous​ data is an ambitious​‌ task, but first attempts​​ on combination of two​​ data types have shown​​​‌ that more general models‌ are feasible and significantly‌​‌ improve the performances. We​​ also address the problem​​​‌ of conciliating structured and‌ non-structured data, as well‌​‌ as data of different​​ levels (individual and contextual​​​‌ data).

On the basis‌ of our previous works‌​‌ (notably on the modeling​​ of networks and texts),​​​‌ we first intend to‌ continue to propose generative‌​‌ models for (at least​​ two) different types of​​​‌ data. Among the target‌ data types for which‌​‌ we would like to​​ propose generative models, we​​​‌ can cite images and‌ biological data, networks and‌​‌ images, images and texts,​​ and texts and ordinal​​​‌ data. To this end,‌ we explore modelings through‌​‌ common latent spaces or​​ by hybridizing several generative​​​‌ models within a global‌ framework. We are also‌​‌ interested in including potential​​ corruption processes into these​​​‌ heterogeneous generative models. For‌ example, we are developing‌​‌ new models that can​​ handle missing values, under​​​‌ various sorts of missingness‌ assumptions.

Besides the modeling‌​‌ point of view, we​​ are also interested in​​​‌ making existing algorithms and‌ implementations more fit for‌​‌ "dirty data". We study​​ in particular ways to​​​‌ robustify algorithms, or to‌ improve heuristics that handle‌​‌ missing/corrupted values or non-Euclidean​​ features.

4 Application domains​​​‌

Although the team members‌ conduct a theoretical research‌​‌ in statistical and machine​​ learning, they are committed​​​‌ to applying their results‌ to solve concrete problems‌​‌ in the following areas:​​

Medicine

 Most team members​​​‌ apply their research work‌ to Medicine or extract‌​‌ theoretical AI problems from​​ medical situations. In particular,​​​‌ our main applications to‌ Medicine are focused on‌​‌ pharmacovigilance, medical imaging, and​​ omics. It is worth​​​‌ noticing that medical applications‌ cover all research axes‌​‌ of the team due​​ to the high diversity​​​‌ of data types and‌ AI questions.

Digital humanities‌​‌

 Another important application field​​ for Maasai is the​​​‌ increasingly dynamic one of‌ digital humanities. It is‌​‌ an extremely motivating field​​ due to the very​​​‌ original questions that are‌ addressed. Indeed, archeologits and‌​‌ historians have questions that​​ are quite different from​​​‌ the usual ones in‌ AI. This allows the‌​‌ team to formalize original​​ AI problems that can​​​‌ be generalized to other‌ fields, allowing to indirectly‌​‌ contribute to the general​​ theory and methodology of​​​‌ AI. Furthermore, the team‌ maintains an active collaboration‌​‌ with researchers on the​​ study and detection of​​​‌ issues related to social‌ justice (e.g. objectification) in‌​‌ movies. It is worth​​ mentionning that Marco Corneli​​​‌ has a double appointment‌ in Mathematics (Maasai) and‌​‌ Archeology (CEPAM).

Astrophysics and​​ physical sciences

Maasai is​​​‌ actively interested in the‌ emerging applications of deep‌​‌ and statistical learning to​​ the fields of Astrophysics​​​‌ and Physical sciences. These‌ phenomenons studied in these‌​‌ fields follow a number​​ of very particular physical​​​‌ laws and differential equations‌ that offer new constraints‌​‌ and methods to explore.​​ As such, the team​​​‌ participates in a number‌ of initiatives in collaboration‌​‌ with both the Observatoire​​ de la Côte d'Azur​​​‌ and other inria teams‌ specialized in scientific computation.‌​‌

Other application domains

 Other​​​‌ topics of interest of​ the team include autonomous​‌ vehicles, bioinformatics, multimedia and​​ ecology.

5 Highlights of​​​‌ the year

  • The 3IA​ chair of Pierre-Alexandre Mattei​‌ has been renewed by​​ the international jury of​​​‌ Institut 3IA Côte d'Azur​ for an additional 4​‌ years.
  • Maasai was associated​​ with two successful BPI​​​‌ projects: Logie IA and​ CO2PILOTE.
  • Charles Bouveyron has​‌ delivered a keynote talk​​ at the 7th International​​​‌ Conference on Statistics: Theory​ and Applications, Paris,​‌ France, in August 2025.​​
  • Pierre-Alexandre Mattei has delivered​​​‌ a keynote talk at​ the 56th Journées de​‌ Statistique of the French​​ Statistical Society, in Marseille,​​​‌ in June 2025.

6​ Latest software developments, platforms,​‌ open data

6.1 Latest​​ software developments

6.1.1 Indago​​​‌ - Web Interface

  • Name:​
    Indago - Web Interface​‌
  • Keywords:
    Clustering, Cluster, Clusters,​​ Artificial intelligence, Unsupervised learning,​​​‌ Graph, Directed graphs, Graph​ algorithmics, Graph summaries, Graph​‌ processing, Statistical learning, Statistics,​​ Data visualization, Graph visualization,​​​‌ Visualization, Information visualization
  • Scientific​ Description:
    Indago implements a​‌ textual graph clustering method​​ based on a joint​​​‌ analysis of the graph​ structure and the content​‌ exchanged between each nodes.​​ This allows to reach​​​‌ a better segmentation than​ what could be obtained​‌ with traditional methods. Indago's​​ main applications are built​​​‌ around communication network analysis,​ including social networks. However,​‌ Indago can be applied​​ on any graph-structured textual​​​‌ network. Thus, Indago has​ been tested on various​‌ data, such as tweet​​ corpus, mail networks, scientific​​​‌ paper co-publication networks, etc.​
  • Functional Description:
    Visualization platform,​‌ which, when used along​​ with the Indago processing​​​‌ module, gives a tool​ for static clustering of​‌ a network with textual​​ edges based on a​​​‌ joint analysis of the​ network structure and the​‌ content of the communications​​
  • Contact:
    Charles Bouveyron
  • Participant:​​​‌
    3 anonymous participants

6.1.2​ Indago - Processing Module​‌

  • Name:
    Indago - Processing​​ Module
  • Keywords:
    Clustering, Cluster,​​​‌ Clusters, Unsupervised learning, Graph,​ Directed graphs, Graph algorithmics,​‌ Graph summaries, Graph processing,​​ Statistical learning, Statistics
  • Scientific​​​‌ Description:
    Indago implements a​ textual graph clustering method​‌ based on a joint​​ analysis of the graph​​​‌ structure and the content​ exchanged between each nodes.​‌ This allows to reach​​ a better segmentation than​​​‌ what could be obtained​ with traditional methods. Indago's​‌ main applications are built​​ around communication network analysis,​​​‌ including social networks. However,​ Indago can be applied​‌ on any graph-structured textual​​ network. Thus, Indago has​​​‌ been tested on various​ data, such as tweet​‌ corpus, mail networks, scientific​​ paper co-publication networks, etc.​​​‌
  • Functional Description:
    Static clustering​ of a network with​‌ textual edges based on​​ a joint analysis of​​​‌ the network structure and​ the content of the​‌ communications
  • Contact:
    Charles Bouveyron​​
  • Participant:
    3 anonymous participants​​​‌

6.1.3 SemiPy

  • Name:
    SemiPy:​ Deep semi-supervised with Python​‌
  • Keywords:
    Machine learning, Semi-supervised​​ classification, Deep learning, Pytorch,​​​‌ Python
  • Scientific Description:
    This​ Python library allows to​‌ train differentiable machine learning​​ models (such as deep​​​‌ neural networks) in a​ semi-supervised way (i.e. with​‌ a dataset that is​​ only partially labelled). It​​​‌ is based on the​ PyTorch deep learning library.​‌ In particular, pseudo-label methods​​ based on data augmentation​​ (such as Fixmatch) are​​​‌ implemented.
  • Functional Description:
    Train‌ differentiable machine learning models‌​‌ (such as deep neural​​ networks) in a semi-supervised​​​‌ way.
  • URL:
  • Contact:‌
    Pierre-Alexandre Mattei
  • Partner:
    Naval‌​‌ Group

6.1.4 RO3SE

  • Name:​​
    Robust Semi-Supervised Speech Enhancement​​​‌
  • Keywords:
    Deep learning, Speech‌ processing, Unsupervised learning, Data‌​‌ augmentation
  • Functional Description:
    This​​ software aims at improving​​​‌ the performance of speech‌ enhancement algorithms by leveraging‌​‌ unsupervised/unlabeled data that correspond​​ to real-life audio scenarios.​​​‌ It is composed of‌ data augmentation methods, deep‌​‌ neural network architectures, semi-supervised​​ training methods, loss functions,​​​‌ and evaluation tools to‌ develop and evaluate deep-learning-based‌​‌ semi-supervised speech enhancement algorithms.​​ The code is both​​​‌ documented and unit-tested.
  • Contact:‌
    Pierre-Alexandre Mattei
  • Participant:
    3‌​‌ anonymous participants
  • Partner:
    PULSE​​ AUDITION

6.1.5 StreamETM

  • Keywords:​​​‌
    Machine learning, Artificial intelligence,‌ Natural language processing
  • Scientific‌​‌ Description:

    StreamETM is an​​ application designed for dynamic​​​‌ topic modeling using the‌ Embedded Topic Model (ETM).‌​‌ It processes streaming text​​ data, merges topic models​​​‌ over time, and detects‌ change points in topic‌​‌ distributions.

    Features: 1) Dynamic​​ Topic Modeling: Continuously update​​​‌ topic models with new‌ data chunks. 2) Topic‌​‌ Merging: Merge new topic​​ models with existing ones​​​‌ to maintain a coherent‌ topic structure. 3) Change‌​‌ Point Detection: Detect significant​​ changes in topic distributions​​​‌ over time using Online‌ Change Point Detection (OCPD).‌​‌ 4) Preprocessing: Preprocess text​​ data, including lemmatization, stopword​​​‌ removal, and frequency-based filtering.‌

  • Functional Description:
    Online topic‌​‌ modeling using the Embedded​​ Topic Model (ETM) to​​​‌ process streaming text data,‌ merge topic models over‌​‌ time, and detect change​​ points in topic distributions.​​​‌
  • News of the Year:‌
    The software has been‌​‌ developed as part of​​ the paper, Merging Embedded​​​‌ Topics with Optimal Transport‌ for Online Topic Modeling‌​‌ on Data Streams (https://arxiv.org/pdf/2504.07711)​​
  • Publication:
  • Contact:
    Federica​​​‌ Granese
  • Participant:
    4 anonymous‌ participants

6.1.6 HERACLES

  • Name:‌​‌
    Online Topic Change Point​​ Detection
  • Keywords:
    Incremental clustering,​​​‌ Continual Learning
  • Functional Description:‌
    This script performs topic‌​‌ modeling on chunks of​​ documents using BERTopic and​​​‌ detects change points in‌ topic distributions over time‌​‌ using Online Change Point​​ Detection (OCPD). It processes​​​‌ document chunks iteratively, updating‌ the topic model with‌​‌ each new chunk, and​​ identifies significant changes in​​​‌ topics over time.
  • Contact:‌
    Serena Villata

6.1.7 HDClassif‌​‌

  • Name:
    High Dimensional Supervised​​ Classification and Clustering
  • Keywords:​​​‌
    Classification, Statistical methods
  • Functional‌ Description:
    The HDclassif package‌​‌ is devoted to the​​ clustering and the discriminant​​​‌ analysis of high-dimensional data.‌ The classification methods proposed‌​‌ in the package result​​ from a new parametrization​​​‌ of the Gaussian mixture‌ model which combines the‌​‌ idea of dimension reduction​​ and model constraints on​​​‌ the covariance matrices. The‌ supervised classification method using‌​‌ this parametrization has been​​ called High Dimensional Discriminant​​​‌ Analysis (HDDA). In a‌ similar manner, the associated‌​‌ clustering method has been​​ called High Dimensional Data​​​‌ Clustering (HDDC) and uses‌ the Expectation-Maximization (EM) algorithm‌​‌ for inference. In order​​ to correctly fit the​​​‌ data, both methods estimate‌ the specific subspace and‌​‌ the intrinsic dimension of​​ the groups. Due to​​​‌ the constraints on the‌ covariance matrices, the number‌​‌ of parameters to estimate​​​‌ is significantly lower than​ other model-based methods and​‌ this allows the methods​​ to be stable and​​​‌ efficient in high-dimensional spaces.​ Experiments on artificial and​‌ real datasets show that​​ HDDC and HDDA perform​​​‌ better than existing classical​ methods on high-dimensional datasets,​‌ even with small datasets.​​
  • URL:
  • Contact:
    Charles​​​‌ Bouveyron
  • Participant:
    3 anonymous​ participants
  • Partner:
    Université Paris-Descartes​‌

6.2 New platforms

Cassiopy​​

Website: https://pypi.org/project/cassiopy/

Participants: Vincent​​​‌ Vandewalle.

  • Software Family​ : vehicle;
  • Audience:​‌ community;
  • Evolution and​​ maintenance: basic;
  • Free​​​‌ Description: CassioPy is a​ python library for clustering​‌ using the Skew-t distribution.​​ This model is designed​​​‌ to handle data with​ skewness, providing more accurate​‌ clustering results in many​​ real-world scenarios where data​​​‌ may not follow a​ normal distribution.
CleverFish

Website:​‌ https://3ia-demos.inria.fr/en/demos/cleverfish/

Participants: Charles Bouveyron​​, Remy Sun,​​​‌ Diane Lingrand.

  • Software​ Family : vehicle;​‌
  • Audience: community;
  • Evolution​​ and maintenance: basic;​​​‌
  • Free Description: CleverFish is​ a novel tool designed​‌ to bridge the gap​​ between marine biology and​​​‌ AI. CleverFish tackles three​ core challenges of an​‌ efficient management tool: providing​​ an easy-to-use graphical user​​​‌ interface, accommodating global and​ video-specific in app biodiversity​‌ assessment, allowing fast and​​ efficient extraction of temporal​​​‌ and spatial fish species​ distribution in a format​‌ understandable for ecologists.

7​​ New results

7.1 Unsupervised​​​‌ learning

7.1.1 Deep Latent​ Position Topic Model (LPTM)​‌ for Clustering and Representation​​ of Networks with Textual​​​‌ Edges

Participants: Charles Bouveyron​, Rémi Boutin,​‌ Pierre Latouche.

Keywords:​​ Generative models, Clustering, Networks,​​​‌ Text, Topic modeling

Numerical​ interactions leading to users​‌ sharing textual content published​​ by others are naturally​​​‌ represented by a network​ where the individuals are​‌ associated with the nodes​​ and the exchanged texts​​​‌ with the edges. To​ understand those heterogeneous and​‌ complex data structures, clustering​​ nodes into homogeneous groups​​​‌ as well as rendering​ a comprehensible visualization of​‌ the data is mandatory.​​ To address both issues,​​​‌ we introduced in 14​ Deep-LPTM, a model-based clustering​‌ strategy relying on a​​ variational graph auto-encoder approach​​​‌ as well as a​ probabilistic model to characterize​‌ the topics of discussion.​​ Deep-LPTM allows to build​​​‌ a joint representation of​ the nodes and of​‌ the edges in two​​ embeddings spaces. The parameters​​​‌ are inferred using a​ variational inference algorithm. We​‌ also introduce IC2L, a​​ model selection criterion specifically​​​‌ designed to choose models​ with relevant clustering and​‌ visualization properties. An extensive​​ benchmark study on synthetic​​​‌ data is provided. In​ particular, we find that​‌ Deep-LPTM better recovers the​​ partitions of the nodes​​​‌ than the state-of-the-art ETSBM​ and STBM (see Figure​‌ 1). Eventually, the​​ emails of the Enron​​​‌ company are analyzed and​ visualizations of the results​‌ are presented, with meaningful​​ highlights of the graph​​​‌ structure.

Figure 1

The image compares​ four network diagrams. The​‌ "True network" on the​​ far left shows three​​​‌ distinct clusters with clear​ connections between them. The​‌ second diagram, labeled "SBM​​ LDA," depicts a densely​​​‌ interconnected network with less​ defined clusters. The third​‌ diagram, "ETSBM," shows a​​ network with more structured​​ clusters, though still interconnected.​​​‌ The fourth diagram, "Deep-LPTM,"‌ displays a network with‌​‌ distinct, clearly separated clusters​​ similar to the true​​​‌ network but with some‌ differences in connectivity. Each‌​‌ diagram uses different colors​​ to indicate different clusters​​​‌ within the network.

Figure‌ 1:

Illustration of‌​‌ Deep-LPTM main contributions on​​ a synthetic network.

7.1.2​​​‌ A Tutorial on Discriminative‌ Clustering and Mutual Information‌​‌

Participants: Pierre-Alexandre Mattei,​​ Louis Ohl, Frédéric​​​‌ Precioso.

Keywords: Clustering,‌ Deep learning

To cluster‌​‌ data is to separate​​ samples into distinctive groups​​​‌ that should ideally have‌ some cohesive properties. Today,‌​‌ numerous clustering algorithms exist,​​ and their differences lie​​​‌ essentially in what can‌ be perceived as “cohesive‌​‌ properties”. Therefore, hypotheses on​​ the nature of clusters​​​‌ must be set: they‌ can be either generative‌​‌ or discriminative. As the​​ last decade witnessed the​​​‌ impressive growth of deep‌ clustering methods that involve‌​‌ neural networks to handle​​ high-dimensional data often in​​​‌ a discriminative manner, we‌ concentrate mainly on the‌​‌ discriminative hypotheses. In 30​​, our aim is​​​‌ to provide an accessible‌ historical perspective on the‌​‌ evolution of discriminative clustering​​ methods and notably how​​​‌ the nature of assumptions‌ of the discriminative models‌​‌ changed over time: from​​ decision boundaries to invariance​​​‌ critics. We notably highlight‌ how mutual information has‌​‌ been a historical cornerstone​​ of the progress of​​​‌ (deep) discriminative clustering methods.‌ We also show some‌​‌ known limitations of mutual​​ information and how discriminative​​​‌ clustering methods tried to‌ circumvent those. We then‌​‌ discuss the challenges that​​ discriminative clustering faces with​​​‌ respect to the selection‌ of the number of‌​‌ clusters. Finally, we showcase​​ these techniques using the​​​‌ dedicated Python package, GemClus,‌ that we have developed‌​‌ for discriminative clustering.

7.1.3​​ Clustering by Deep Latent​​​‌ Position Model with Graph‌ Convolutional Network

Participants: Charles‌​‌ Bouveyron, Marco Corneli​​.

Collaborations: Pierre Latouche,​​​‌ Dingge Liang

Keywords: Multiview‌ graphs, Clustering, Variational Autoencoding‌​‌ Inference

With the significant​​ increase of interactions between​​​‌ individuals through numeric means,‌ clustering of vertices in‌​‌ graphs has become a​​ fundamental approach for analyzing​​​‌ large and complex networks.‌ In 26, we‌​‌ propose the deep latent​​ position model (DeepLPM), an​​​‌ end-to-end generative clustering approach‌ which combines the widely‌​‌ used latent position model​​ (LPM) for network analysis​​​‌ with a graph convolutional‌ network (GCN) encoding strategy.‌​‌ Moreover, an original estimation​​ algorithm is introduced to​​​‌ integrate the explicit optimization‌ of the posterior clustering‌​‌ probabilities via variational inference​​ and the implicit optimization​​​‌ using stochastic gradient descent‌ for graph reconstruction. Numerical‌​‌ experiments on simulated scenarios​​ highlight the ability of​​​‌ DeepLPM to self-penalize the‌ evidence lower bound for‌​‌ selecting the intrinsic dimension​​ of the latent space​​​‌ and the number of‌ clusters, demonstrating its clustering‌​‌ capabilities compared to state-of-the-art​​ methods. Finally, DeepLPM is​​​‌ further applied to an‌ ecclesiastical network in Merovingian‌​‌ Gaul and to a​​ citation network Cora to​​​‌ illustrate the practical interest‌ in exploring large and‌​‌ complex real-world networks.

7.1.4​​ The Multiplex Deep Latent​​​‌ Position Model for the‌ Clustering of nodes in‌​‌ Multiview Networks

Participants: Charles​​​‌ Bouveyron, Marco Corneli​.

Collaborations: Pierre Latouche,​‌ Dingge Liang, Junping Yin​​

Keywords: Multiview graphs, Clustering,​​​‌ Variational Autoencoding Inference

Multiplex​ networks capture multiple types​‌ of interactions among the​​ same set of nodes,​​​‌ creating a complex, multi-relational​ framework. A typical example​‌ is a social network​​ where nodes (actors) are​​​‌ connected by various types​ of ties, such as​‌ professional, familial, or social​​ relationships. Clustering nodes in​​​‌ these networks is a​ key challenge in unsupervised​‌ learning, given the increasing​​ prevalence of multiview data​​​‌ across domains. While previous​ research has focused on​‌ extending statistical models to​​ handle such networks, these​​​‌ adaptations often struggle to​ fully capture complex network​‌ structures and rely on​​ computationally intensive Markov chain​​​‌ Monte Carlo (MCMC) for​ inference, rendering them less​‌ feasible for effective network​​ analysis. To overcome these​​​‌ limitations, in 27 we​ propose the multiplex deep​‌ latent position model (MDLPM,​​ see Fig. 2),​​​‌ which generalizes and extends​ latent position models to​‌ multiplex networks. MDLPM combines​​ deep learning with variational​​​‌ inference to effectively tackle​ both the modeling and​‌ computational challenges raised by​​ multiplex networks. Unlike most​​​‌ existing deep learning models​ for graphs that require​‌ external clustering algorithms (e.g.,​​ k-means) to group nodes​​​‌ based on their latent​ embeddings, MDLPM integrates clustering​‌ directly into the learning​​ process, enabling a fully​​​‌ unsupervised, end-to-end approach. This​ integration improves the ability​‌ to uncover and interpret​​ clusters in multiplex networks​​​‌ without relying on external​ procedures. Numerical experiments across​‌ various synthetic data sets​​ and two real-world networks​​​‌ demonstrate the performance of​ MDLPM compared to state-of-the-art​‌ methods, highlighting its applicability​​ and effectiveness for multiplex​​​‌ network analysis.

Figure 2

The image​ depicts a framework for​‌ multiview network analysis. It​​ starts with multiple network​​​‌ views processed by separate​ encoders, generating mean and​‌ variance outputs. These outputs​​ are combined using MLPs​​​‌ (Multilayer Perceptrons) to produce​ latent variables. These latent​‌ variables feed into LPM-based​​ decoders to reconstruct the​​​‌ original network views, facilitating​ the generative modeling of​‌ multiview networks.

Figure 2​​: Architecture of the​​​‌ proposed graph variational auto-encoder.​

7.1.5 Scaling Optimal Transport​‌ to High-Dimensional Gaussian Distributions​​

Participants: Charles Bouveyron,​​​‌ Marco Corneli.

Keywords:​ Optimal transport, High-dimensional Gaussian​‌ distributions, Subspace modeling

Although​​ optimal transport (OT) has​​​‌ recently become very popular​ in machine learning, it​‌ faces challenges when dealing​​ with high-dimensional data, such​​​‌ as images or omics​ data. Current OT approaches​‌ for high-dimensional situations rely​​ on projections of the​​​‌ data or measures onto​ low-dimensional spaces, which inevitably​‌ results in information loss.​​ In 62, we​​​‌ consider the case of​ high-dimensional Gaussian distributions with​‌ parsimonious covariance structures and​​ lower intrinsic dimension. We​​​‌ exhibit a simplified closed-form​ expression of the 2-Wasserstein​‌ (W2) distance with an​​ efficient and robust calculation​​​‌ procedure based on a​ low-dimensional decomposition of empirical​‌ covariance matrices, without relying​​ on data projections. Furthermore,​​​‌ we provide a closed-form​ expression for the Monge​‌ map, which involves the​​ exact calculation of the​​​‌ square-root and inverse square-root​ of the source distribution​‌ covariance matrix. This approach​​ offers analytical and computational​​ advantages, as demonstrated by​​​‌ our numerical experiments, which‌ quantitatively evaluate these benefits‌​‌ in comparison to existing​​ methods. In addition to​​​‌ being able to compute‌ both the W2 2-distance‌​‌ and the transport map,​​ our method outperforms model-free​​​‌ methods, in high dimension,‌ even in the case‌​‌ of non-Gaussian distributions.

7.1.6​​ A Model-Based Clustering Approach​​​‌ for Toxicity Assessment Using‌ Cell Painting Data

Participants:‌​‌ Mariam Grigoryan, Vincent​​ Vandewalle.

Collaborations: David​​​‌ Rouquié

Keywords: Cells, Clustering‌

Cell Painting is a‌​‌ high-content imaging approach that​​ enables the simultaneous analysis​​​‌ of multiple cellular compartments,‌ providing a comprehensive view‌​‌ of how chemical compounds​​ affect cell morphology and​​​‌ organelle organization. In this‌ work, we study these‌​‌ morphological changes in a​​ dose–response framework with the​​​‌ objective of identifying the‌ Point of Departure (POD),‌​‌ defined as the lowest​​ concentration at which significant​​​‌ morphological alterations are observed.‌

The dose–response problem is‌​‌ addressed using a two-step​​ methodology. First, cells within​​​‌ each well of the‌ assay plates are clustered‌​‌ using a Gaussian mixture​​ model. In this model,​​​‌ cluster-specific parameters are assumed‌ to be shared across‌​‌ wells, while cluster proportions​​ are allowed to vary​​​‌ from one well to‌ another. This formulation makes‌​‌ it possible to summarize​​ each well by a​​​‌ low-dimensional vector of class‌ proportions. In a second‌​‌ step, a sequential statistical​​ testing procedure is applied​​​‌ to identify the minimum‌ concentration at which a‌​‌ compound induces a significant​​ shift in cell characteristics​​​‌ relative to negative control‌ wells. This comparison is‌​‌ performed using a non-parametric​​ permutation test on the​​​‌ vectors of class proportions,‌ with tests conducted sequentially‌​‌ over increasing compound concentrations.​​

The proposed approach is​​​‌ applied to Cell Painting‌ data and demonstrates its‌​‌ ability to detect distributional​​ shifts in cellular phenotypes​​​‌ across compound concentrations, thereby‌ providing a robust and‌​‌ interpretable framework for POD​​ identification.

This work was​​​‌ disseminated through a poster‌ presentation at the Model-Based‌​‌ Clustering Workshop (INRIA, July​​ 21–25), a poster presentation​​​‌ at EUROTOX (September 14–17,‌ Athens, Greece) (67‌​‌), and an oral​​ presentation at the Journées​​​‌ de Statistique (JDS, June‌ 2–6, Marseille) (47‌​‌).

7.1.7 Clustering of​​ reccurrent events

Participants: Vincent​​​‌ Vandewalle.

Collaborations: Génia‌ Babykina, Jésus Carretero-Bravo

Keywords:‌​‌ Clustering, Mixture models

Nowadays​​ data are often timestamped,​​​‌ thus, when analyzing the‌ events which may occur‌​‌ several times (recurrent events),​​ it is desirable to​​​‌ model the whole dynamics‌ of the counting process‌​‌ rather than to focus​​ on a total number​​​‌ of events. Such kind‌ of data can be‌​‌ encountered in hospital readmissions,​​ disease recurrences or repeated​​​‌ failures of industrial systems.‌ Recurrent events can be‌​‌ analyzed in the counting​​ process framework, as in​​​‌ the Andersen-Gill model, assuming‌ that the baseline intensity‌​‌ depends on time and​​ on covariates, as in​​​‌ the Cox model. However,‌ observed covariates are often‌​‌ insufficient to explain the​​ observed heterogeneity in the​​​‌ data. We propose a‌ mixture model for recurrent‌​‌ events, allowing to account​​ for the unobserved heterogeneity​​​‌ and to perform clustering‌ of individuals (unsupervised classification‌​‌ allowing to partition the​​​‌ heterogeneous data according to​ unobserved, or latent, variables).​‌ Within each cluster, the​​ recurrent event process intensity​​​‌ is specified parametrically and​ is adjusted for covariates.​‌ Model parameters are estimated​​ by maximum likelihood using​​​‌ the EM algorithm; the​ Bayesian Information Criterion (BIC)​‌ criterion is adopted to​​ choose an optimal number​​​‌ of clusters. The model​ feasibility is checked on​‌ simulated data. Real data​​ on hospital readmissions of​​​‌ elderly people, which motivated​ the development of the​‌ proposed clustering model, are​​ analyzed. The obtained results​​​‌ allow a fine understanding​ of the recurrent event​‌ process in each cluster.​​ This work has been​​​‌ published in 12 and​ also in a journal​‌ on geriarty 36.​​

7.1.8 Unsupervised machine learning​​​‌ analysis to enhance risk​ stratification in patients with​‌ asymptomatic aortic stenosis

Participants:​​ Arnaud Droit, Pierre-Alexandre​​​‌ Mattei, Louis Ohl​, Frédéric Precioso.​‌

Collaborations: Marie-Ange Fleury, Philippe​​ Pibarot, and other researchers​​​‌ from Université Laval (Québec)​

Keywords: Clustering, Cardiology

There​‌ is a lack of​​ studies investigating the pathophysiologic​​​‌ and phenotypic distinctiveness of​ aortic stenosis (AS). This​‌ heterogeneity has important implications​​ for identifying optimal intervention​​​‌ timing and potential medical​ management. The study 19​‌ seeks to identify phenogroups​​ of AS using unsupervised​​​‌ machine learning to improve​ risk stratification. A total​‌ of 349 patients with​​ asymptomatic AS from the​​​‌ PROGRESSA study were included​ in this analysis. Echocardiographic,​‌ clinical and blood sample​​ data were used in​​​‌ the unsupervised clustering process.​ Longitudinal echocardiographic data were​‌ used to evaluate AS​​ progression. Five clusters of​​​‌ patients were revealed using​ 18 variables selected by​‌ an unsupervised machine learning​​ algorithm (see Fig. 3​​​‌). This approach may​ be useful to optimize​‌ and individualize medical and​​ interventional management of AS.​​​‌

Figure 3

The image illustrates a​ generative model for multiview​‌ networks. It consists of​​ multiple network views processed​​​‌ by separate encoders. These​ encoders generate mean and​‌ standard deviation values, which​​ are fed into Multilayer​​​‌ Perceptrons (MLPs) to produce​ latent variables. These variables,​‌ along with a shared​​ parameter, are then input​​​‌ into LPM-based decoders. The​ decoders reconstruct the network​‌ views, creating a final​​ reconstructed network. The flow​​​‌ demonstrates an inference procedure​ that transforms multiview networks​‌ into reconstructed versions through​​ encoding and decoding processes.​​​‌

Figure 3: Clustering​ of PROGRESSA patients.

7.1.9​‌ Non parametric multiple partition​​ clustering

Participants: Vincent Vandewalle​​​‌.

Collaborations: Marie Du​ Roy de Chaumaray, Matthieu​‌ Marbac-Lourdelle

Keywords: Clustering, Mixture​​ Models

In the framework​​​‌ of model-based clustering, a​ model, called multi-partitions clustering,​‌ allowing several latent class​​ variables has been proposed.​​​‌ This model assumes that​ the distribution of the​‌ observed data can be​​ factorized into several independent​​​‌ blocks of variables, each​ block following its own​‌ mixture model. In this​​ study, we assume that​​​‌ each block follows a​ non-parametric latent class model,​‌ i.e. independence of the​​ variables in each component​​​‌ of the mixture with​ no parametric assumption on​‌ their class conditional distribution.​​ The purpose is to​​​‌ deduce, from the observation​ of a sample, the​‌ number of blocks, the​​ partition of the variables​​ into the blocks and​​​‌ the number of components‌ in each block, which‌​‌ characterize the proposed model.​​ By following recent literature​​​‌ on model and variable‌ selection in non-parametric mixture‌​‌ models, we propose to​​ discretize the data into​​​‌ bins. This permits to‌ apply the classical multi-partition‌​‌ clustering procedure for parametric​​ multinomials, which are based​​​‌ on a penalized likelihood‌ method (e.g. BIC). The‌​‌ consistency of the procedure​​ is obtained and an​​​‌ efficient optimization is proposed.‌ The performances of the‌​‌ model are investigated on​​ simulated data. This work​​​‌ has been presented in‌ 39, and also‌​‌ in 40.

7.1.10​​ An EM Stopping Rule​​​‌ for Avoiding Degeneracy in‌ Gaussian-based Clustering with Missing‌​‌ Data

Participants: Vincent Vandewalle​​.

Collaborations: Christophe Biernacki​​​‌

Keywords: Expectation-Minimization, Clustering

Missing‌ data frequency increases with‌​‌ the growing size of​​ multivariate modern datasets. In​​​‌ Gaussian model-based clustering, the‌ EM algorithm easily takes‌​‌ into account such data​​ but the degeneracy problem​​​‌ is dramatically aggravated during‌ the EM runs: parameter‌​‌ degeneracy is quite slow​​ and also more frequent​​​‌ than with complete data.‌ Consequently, parameter degenerated solutions‌​‌ may be confused with​​ valuable parameter solutions and,​​​‌ in addition, computing time‌ may be wasted through‌​‌ wrong runs. In this​​ work, a simple and​​​‌ low informational condition on‌ the latent partition allows‌​‌ to propose a very​​ simple partition-based stopping rule​​​‌ of EM which shows‌ good behavior on numerical‌​‌ experiments. This work has​​ been presented in 38​​​‌.

7.1.11 Unveiling Hidden‌ Structures in the Main‌​‌ Belt: A Probabilistic Framework​​ for Asteroid Families

Participants:​​​‌ Maya Guy, Vincent‌ Vandewalle.

Keywords: Asteroids,‌​‌ Asteroid families, Probabilistic models,​​ Clustering

Collaborations: Benoit Carry​​​‌

The identification of asteroid‌ families is a key‌​‌ question in planetary sciences,​​ offering crucial insights into​​​‌ the collisional and dynamical‌ history of the asteroid‌​‌ Main Belt (MB). These​​ families, originating from the​​​‌ fragmentation of parent bodies‌ due to catastrophic collisions,‌​‌ form dense clusters in​​ orbital space. Over time,​​​‌ the non-gravitational Yarkovsky effect‌ induces a semi-major axis‌​‌ drift, producing the characteristic​​ V-shaped patterns in the​​​‌ (semi-major axis, absolute magnitude)‌ plane. Current techniques for‌​‌ family identification suffer from​​ several limitations. The most​​​‌ widely used approach, the‌ Hierarchical Clustering Method (HCM),‌​‌ does not account for​​ the presence of a​​​‌ background population, leading to‌ overestimated family cores and‌​‌ the omission of their​​ extended wings. Furthermore, no​​​‌ families older than 2Gyr‌ have been confidently identified‌​‌ using this method. Additionally,​​ HCM assumes that families​​​‌ are non-overlapping in the‌ proper element space, an‌​‌ unrealistic assumption as young​​ families may overlap with​​​‌ older, more diffuse, families.‌ To overcome some of‌​‌ the HCM limitation, the​​ V-shape method was developed.​​​‌ Being based on the‌ print of Yarkovsky-induced spreading,‌​‌ it successfully allowed to​​ find very old families.​​​‌ While recent combined approaches‌ have incorporated the background‌​‌ population into family detection​​ frameworks, they still lack​​​‌ an intrinsic mechanism for‌ handling overlapping families and‌​‌ do not yield probabilistic​​ membership lists. In this​​​‌ study (68),‌ we propose a new‌​‌ probabilistic approach for identifying​​​‌ asteroid families in the​ MB, using model-based clustering.​‌ We model the observed​​ population of the MB​​​‌ as a mixture of​ skewed-t distributions for eccentricity,​‌ inclination, and absolute magnitude,​​ coupled with a gaussian​​​‌ distribution for semi-major axis​ that explicitly depends on​‌ absolute magnitude which captures​​ the Yarkovsky-driven semi-major axis​​​‌ evolution. The parameters, which​ define the shape and​‌ orientation of each cluster​​ along each dimension, and​​​‌ mixture proportions of the​ model are estimated using​‌ the Expectation-Maximization (EM) algorithm.​​ This model also includes​​​‌ a uniform background component​ for the primordial asteroid​‌ population. This flexible approach​​ accommodates anisotropic and overlapping​​​‌ family structures, and provides​ a probabilistic membership assignments,​‌ enabling a more nuanced​​ and robust classification of​​​‌ asteroid families. We will​ present the methodology and​‌ results from simulated datasets​​ to demonstrate the performance​​​‌ and advantages of this​ approach.

7.1.12 From Fragments​‌ to Families: Asteroid Clustering​​

Participants: Maya Guy,​​​‌ Vincent Vandewalle.

Keywords:​ Asteroids, Asteroid families, Probabilistic​‌ models, Clustering

Collaborations: Benoit​​ Carry

This study 69​​​‌ provides a new probabilistic​ approach for identifying asteroid​‌ families in the Main​​ Belt (MB), addressing limitations​​​‌ of traditional methods like​ the Hierarchical Clustering Method​‌ (HCM). These methods often​​ overestimate family cores, miss​​​‌ extended wings (halos), and​ struggle with identifying older​‌ families and overlapping structures.​​ The proposed model uses​​​‌ skewed-t distributions for eccentricity,​ inclination, and absolute magnitude,​‌ coupled with a Gaussian​​ distribution for semi-major axis​​​‌ evolution influenced by the​ Yarkovsky effect. This flexible​‌ approach accommodates anisotropic and​​ overlapping family structures, providing​​​‌ probabilistic membership assignments for​ a more nuanced classification.​‌

7.1.13 An in depth​​ look at the Procrustes-Wasserstein​​​‌ distance: properties and barycenters​

Participants: Davide Adamo,​‌ Marco Corneli.

Keywords:​​ Optimal Transport, Procrustes Analysis;​​​‌ Point Cloud; Zooarchaeology

Collaborations:​ Manon Vuillien, Emmanuelle Vila​‌

Due to its invariance​​ to rigid transformations such​​​‌ as rotations and reflections,​ Procrustes-Wasserstein (PW) was introduced​‌ in the literature as​​ an optimal transport (OT)​​​‌ distance, alternative to Wasserstein​ and more suited to​‌ tasks such as the​​ alignment and comparison of​​​‌ point clouds. Having that​ application in mind, in​‌ 43, we carefully​​ build a space of​​​‌ discrete probability measures and​ show that over that​‌ space PW actually is​​ a distance. Algorithms to​​​‌ solve the PW problems​ already exist, however we​‌ extend the PW framework​​ by discussing and testing​​​‌ several initialization strategies. We​ then introduce the notion​‌ of PW barycenter and​​ detail an algorithm to​​​‌ estimate it from the​ data. The result is​‌ a new method to​​ compute representative shapes from​​​‌ a collection of point​ clouds. We benchmark our​‌ method against existing OT​​ approaches, demonstrating superior performance​​​‌ in scenarios requiring precise​ alignment and shape preservation​‌ (see Fig. 4).​​ We finally show the​​​‌ usefulness of the PW​ barycenters in an archaeological​‌ context. Our results highlight​​ the potential of PW​​​‌ in boosting 2D and​ 3D point cloud analysis​‌ for machine learning and​​ computational geometry applications.

Figure 4

The​​​‌ image displays a grid​ of 3D models showing​‌ progressive changes from an​​ archaeological form to a​​ modern form. It consists​​​‌ of three rows labeled‌ row1, row2, and row3,‌​‌ and five columns labeled​​ from "archaeo" to "modern."​​​‌ Each model transitions through‌ different stages labeled with‌​‌ fractions (η = 1/5,​​ η = 2/5, η​​​‌ = 3/5, η =‌ 4/5), showing a smooth‌​‌ transformation in shape and​​ texture. The colors range​​​‌ from bright yellow to‌ dark purple.

Figure 4‌​‌: PW barycenter evolution​​ of two 3D point​​​‌ clouds describing an archaeological‌ (Leftmost) and a modern‌​‌ (Rightmost) astragalus of sheep’s​​ species. The four middle​​​‌ columns of the grid‌ correspond to representative interpolations‌​‌ each assigned with a​​ value of η.​​​‌ (row1) Progressive interpolation in‌ the euclidean space, note‌​‌ that the two input​​ point clouds are not​​​‌ aligned and no priori‌ knowledge on pairwise correspondence‌​‌ is considered. The P​​* solution of PW​​​‌ permits us to optimally‌ display the frontal view‌​‌ (row2) and top view​​ (row3), in order to​​​‌ match reference manuals of‌ morphological criteria in archaeology.‌​‌

7.1.14 Rethinking multiple kernel​​ learning under the lenses​​​‌ of Importance Weighted Monte‌ Carlo Variational Inference

Participants:‌​‌ Davide Adamo, Marco​​ Corneli.

Keywords: Multiple​​​‌ kernel learning, Monte Carlo‌ variational inference, Kernel(s) selection,‌​‌ Importance-Weighted lower bound

Collaborations:​​ Manon Vuillien, Emmanuelle Vila​​​‌

Kernel methods have been‌ widely used in machine‌​‌ learning as they are​​ a powerful tool for​​​‌ implicitly mapping data into‌ high-dimensional spaces, enabling the‌​‌ discovery of complex patterns​​ that might be challenging​​​‌ to capture in the‌ original feature space. Although‌​‌ some classification and regression​​ problems can be successfully​​​‌ addressed with a single‌ kernel, sometimes real-world scenarios‌​‌ exhibit complex structures, and​​ it is desirable to​​​‌ employ several kernel types,‌ one for each notion‌​‌ of similarity that one​​ aims to take into​​​‌ account. This is where‌ multiple kernel learning (MKL)‌​‌ comes into play. This​​ paper 60 revisits multi-kernel​​​‌ classification with a specific‌ focus on kernel(s) selection‌​‌ in the light of​​ recent developments in Monte​​​‌ Carlo (importance weighted) variational‌ inference (MCVI). In the‌​‌ framework of kernelized logistic​​ regression (KLR), we consider​​​‌ positive semi-definite linear combinations‌ of kernels and treat‌​‌ the kernel weights as​​ random variables. Proper choices​​​‌ of prior distributions coupled‌ with the explicit derivation‌​‌ of the importance-weighted lower​​ bound (IW-ELBO), generalization of​​​‌ the traditional variational lower‌ bound (ELBO), allow us‌​‌ to both perform kernel​​ selection via shrinking and​​​‌ to perform posterior inference‌ on the kernel weights,‌​‌ without needing MCMC sampling.​​ Unlike pure optimization-based approaches​​​‌ to MKL, our optimization‌ problem does not require‌​‌ explicit constraints and can​​ be optimized by standard​​​‌ stochastic gradient descent (see‌ Fig. 5).

Figure 5

This‌​‌ figure shows 10 plots​​ of posterior densities for​​​‌ kernel weights in a‌ 2x5 grid. Both the‌​‌ blue and orange curve​​ on the plot folloow​​​‌ similar profiles, with the‌ orange curve showing a‌​‌ sharper peak.

Figure 5​​: Posterior densities for​​​‌ kernel weights β1‌,,β‌​‌10 associated to 10​​ different kernel matrices. The​​​‌ shrinking effect is well‌ highlighted for all the‌​‌ kernels weights despite for​​​‌ β3, which​ indicates an informative kernel​‌ matrix.

7.1.15 The Deep​​ Zero-Inflated Latent Position Block​​​‌ Model for the Clustering​ of Nodes in Graphs​‌

Participants: Seydina Niang,​​ Charles Bouveyron, Marco​​​‌ Corneli.

Keywords: Nodes​ clustering, Graph variational autoencoder,​‌ Block modeling, Graph visualization,​​ Zero-inflated Poisson

Collaborations: Pierre​​​‌ Latouche

The evolution in​ storage capacities has led​‌ to a data explosion,​​ making networks essential for​​​‌ modeling relationships between objects​ (nodes). These complex networks​‌ require effective clustering and​​ visualization methods to summarize​​​‌ and interpret their information.​ The deep latent position​‌ block model (Deep-LPBM), designed​​ for binary networks, combines​​​‌ partial block-based clustering and​ continuous latent representation to​‌ visualize nodes. Here, we​​ propose an extension, the​​​‌ deep zero-inflated latent position​ block model (Deep-ZLPBM, 55​‌), designed for non-binary​​ networks, where the entries​​​‌ of the adjacency matrix​ can take integer values.​‌ This model is based​​ on a deep variational​​​‌ autoencoder that integrates a​ graph convolutional network (GCN)​‌ and a decoder leveraging​​ a zero-inflated Poisson (ZIP)​​​‌ distribution. Inference relies on​ the maximization of the​‌ marginal likelihood, and optimization​​ is performed using stochastic​​​‌ gradient descent.

7.1.16 Importance​ weighted directed graph variational​‌ auto-encoder for block modeling​​ of complex networks

Participants:​​​‌ Seydina Niang, Charles​ Bouveyron, Marco Corneli​‌.

Keywords: Nodes clustering,​​ Graph variational autoencoder, Block​​​‌ modeling, Graph visualization, Zero-inflated​ Poisson

Collaborations: Pierre Latouche​‌

This work addresses the​​ fundamental challenges of jointly​​​‌ performing node clustering and​ representation learning in directed​‌ and valued graphs, which​​ need both global and​​​‌ local network structures to​ be captured. While these​‌ two tasks are highly​​ interdependent, they are often​​​‌ treated separately in existing​ works. We propose the​‌ deep zero-inflated latent position​​ block model (Deep-ZLPBM, 65​​​‌) in the context​ of directed and valued​‌ networks characterized by non-symmetric​​ adjacency matrices with positive​​​‌ integer entries. Our approach​ leverages a variational autoencoder​‌ (VAE) framework, combining a​​ directed graph neural network​​​‌ (DirGNN) encoder designed to​ handle directed edges and​‌ a zero-inflated Poisson (ZIP)​​ block modeling decoder to​​​‌ model sparse, integer-weighted interactions.​ Recognizing the limitations of​‌ the standard evidence lower​​ bound (ELBO) in VAEs,​​​‌ we explore the importance​ weighted ELBO (iw-ELBO), a​‌ tighter bound on the​​ marginal log-likelihood optimized via​​​‌ gradient ascent, to enhance​ inference. Extensive experiments on​‌ synthetic datasets demonstrate that​​ iw-ELBO optimization yields significant​​​‌ performance gains. Moreover, our​ results validate that Deep-ZLPBM​‌ effectively models complex network​​ structures, providing interpretable partial​​​‌ memberships and insightful visualizations​ for directed, valued graphs.​‌

7.2 Understanding (deep) learning​​ models

7.2.1 Are Ensembles​​​‌ Getting Better all the​ Time?

Participants: Damien Garreau​‌, Pierre-Alexandre Mattei.​​

Keywords: Ensembles, Diffusion, Random​​​‌ forests

Diffusion models now​ generate high-quality, diverse samples,​‌ with an increasing focus​​ on more powerful models.​​​‌ Although ensembling is a​ well-known way to improve​‌ supervised models, its application​​ to unconditional score-based diffusion​​​‌ models remains largely unexplored.​ In 31 we investigate​‌ whether it provides tangible​​ benefits for generative modeling.​​​‌ We find that while​ ensembling the scores generally​‌ improves the score-matching loss​​ and model likelihood, it​​ fails to consistently enhance​​​‌ perceptual quality metrics such‌ as Fréchet Inception Distance‌​‌ (FID) on image datasets.​​ We confirm this observation​​​‌ across a breadth of‌ aggregation rules using Deep‌​‌ Ensembles, Monte Carlo Dropout,​​ on CIFAR-10 and FFHQ.​​​‌ We attempt to explain‌ this discrepancy by investigating‌​‌ possible explanations, such as​​ the link between score​​​‌ estimation and image quality.‌ We also look into‌​‌ tabular data through random​​ forests, and find that​​​‌ one aggregation strategy outperforms‌ the others. Finally, we‌​‌ provide theoretical insights into​​ the summing of score​​​‌ models, which shed light‌ not only on ensembling‌​‌ but also on several​​ model composition techniques (e.g.​​​‌ guidance).

7.2.2 When Are‌ Two Scores Better Than‌​‌ One? Investigating Ensembles of​​ Diffusion Models

Participants: Raphael​​​‌ Razafindralambo, Remy Sun‌, Damien Garreau,‌​‌ Frederic Precioso, Pierre-Alexandre​​ Mattei.

Keywords: Ensembles,​​​‌ Dropout, Random forests

Ensemble‌ methods combine the predictions‌​‌ of several base models.​​ We study whether including​​​‌ more models always improves‌ their average performance. This‌​‌ question depends on the​​ kind of ensemble considered,​​​‌ as well as the‌ predictive metric chosen. We‌​‌ focus on situations where​​ all members of the​​​‌ ensemble are a priori‌ expected to perform as‌​‌ well, which is the​​ case of several popular​​​‌ methods such as random‌ forests or deep ensembles.‌​‌ In this setting, we​​ show in 28 that​​​‌ ensembles are getting better‌ all the time if,‌​‌ and only if, the​​ considered loss function is​​​‌ convex. More precisely, in‌ that case, the average‌​‌ loss of the ensemble​​ is a decreasing function​​​‌ of the number of‌ models. When the loss‌​‌ function is nonconvex, we​​ show a series of​​​‌ results that can be‌ summarized as: ensembles of‌​‌ good models keep getting​​ better, and ensembles of​​​‌ bad models keep getting‌ worse. To this end,‌​‌ we prove a new​​ result on the monotonicity​​​‌ of tail probabilities that‌ may be of independent‌​‌ interest. We illustrate our​​ results on a medical​​​‌ prediction problem (diagnosing melanomas‌ using neural nets) and‌​‌ a "wisdom of crowds"​​ experiment (guessing the ratings​​​‌ of upcoming movies).

7.2.3‌ Re-examining Concept-based Explainable Models‌​‌ for Multimodal Interpretative Tasks.​​

Participants: Julie Tores,​​​‌ Elisa Ancarani, Rémy‌ Sun, Frédéric Precioso‌​‌.

Keywords: Deep Learning,​​ Multimedia, Concept based model,​​​‌ Objectification

Collaborations: Lucile Sassatelli,‌ Hui-Yin Wu

Concept-based models‌​‌ have been proposed as​​ a new line of​​​‌ research for explainable by-design‌ deep learning models. However,‌​‌ those models show their​​ whole power when applied​​​‌ to benchmarks where the‌ concepts are well defined‌​‌ and the concepts' attributes​​ easily extractable from the​​​‌ raw data. In 52‌, we challenge the‌​‌ most recent concept-based model​​ initially developed for image​​​‌ classification, on more complex‌ interpretative tasks from a‌​‌ recently proposed video benchmark​​ where they perform poorly.​​​‌ We conduct a root‌ cause analysis of the‌​‌ poor performances of state-of-the-art​​ explainable concept-based models for​​​‌ these multimodal interpretative tasks,‌ and propose adaptations to‌​‌ design robust explainable models​​ for detecting character objectification​​​‌ in this novel challenging‌ video benchmark. We show‌​‌ that the optimal architectural​​​‌ choice may vary depending​ on the modality setting,​‌ thereby showing that designing​​ multimodal concept-based approaches remains​​​‌ an open challenge and​ calls for further investigation.​‌

7.2.4 Normative Alignment of​​ Recommender Systems via Internal​​​‌ Label Shift

Participants: Pierre-Alexandre​ Mattei.

Keywords: Deep​‌ Learning, Recommender systems, Fairness,​​ Label Shift, Alignment

Collaborations:​​​‌ Johannes Kruse, Kasper Lindskow,​ Michael Riis Andersen, Ryotaro​‌ Shimizu, Julian McAuley, Jes​​ Frellsen

Recommender systems optimized​​​‌ solely for user engagement​ often fail to meet​‌ broader normative objectives such​​ as fairness, diversity, or​​​‌ editorial values. In 49​, we introduce NAILS​‌ (Normative Alignment of recommender​​ systems via Internal Label​​​‌ Shift), a simple and​ scalable method for aligning​‌ recommendation outputs with target​​ distributions over item-level attributes​​​‌ (e.g., categories). NAILS modifies​ the user-conditional item distribution​‌ to induce a specified​​ marginal distribution over attributes,​​​‌ leveraging existing user–item preferences​ without retraining the model.​‌ To achieve this, we​​ recast the problem as​​​‌ a form of label​ shift applied internally within​‌ a hierarchical classification framework.​​ Adopting a stakeholder-centric perspective,​​​‌ NAILS enables alignment with​ global normative goals. Empirically,​‌ we show that NAILS​​ consistently improves attribute-level alignment​​​‌ with minimal impact on​ user engagement, providing a​‌ practical mechanism for value-driven​​ recommendation.

7.3 Adaptive and​​​‌ robust learning

7.3.1 Mind​ the map! Accounting for​‌ existing map information when​​ estimating online HDMaps from​​​‌ sensor data

Participants: Rémy​ Sun, Li Yang​‌, Diane Lingrand,​​ Frédéric Precioso.

Keywords:​​​‌ Autonomous Driving, HDMaps, Online​ HDMap estimation

Collaborations: ANR​‌ Project MultiTrans

Online High​​ Definition Map (HDMap) estimation​​​‌ from sensors offers a​ low-cost alternative to manually​‌ acquired HDMaps. As such,​​ it promises to lighten​​​‌ costs for already HDMap-reliant​ Autonomous Driving systems, and​‌ potentially even spread their​​ use to new systems.​​​‌ We proposed in 51​ to improve online HDMap​‌ estimation by accounting for​​ already existing maps. We​​​‌ identify 3 reasonable types​ of useful existing maps​‌ (minimalist, noisy, and outdated).​​ We also introduce MapEX​​​‌ (see Fig. 6),​ a novel online HDMap​‌ estimation framework that accounts​​ for existing maps. MapEX​​​‌ achieves this by encoding​ map elements into query​‌ tokens and by refining​​ the matching algorithm used​​​‌ to train classic query​ based map estimation models.​‌ We demonstrate that MapEX​​ brings significant improvements on​​​‌ the nuScenes dataset. For​ instance, given noisy maps,​‌ MapEX improves by 38%​​ over the MapTRv2 detector​​​‌ it is based on​ and by 16% over​‌ the current SOTA (state-of-the-art).​​

Figure 6

This figure shows input​​​‌ images being put through​ a grey shape called​‌ BEV encoder to output​​ a grid of features.​​​‌ On the other hand,​ a map is transformed​‌ into colored square features.​​ Both type of features​​​‌ are put through a​ gray shape called decoder​‌ to output a row​​ of colored boxes that​​​‌ get decoded into map​ elements.

Figure 6:​‌

Overview of our MapEX​​ method (see Sec. 7.3.1​​​‌). We add two​ modules (EX query encoding,​‌ Attribution) to the standard​​ query based map estimation​​​‌ pipeline (in gray on​ the figure). Map elements​‌ are encoded into EX​​ queries, then decoded with​​ a standard decoder.

7.3.2​​​‌ Efficiency in the classification‌ of chest X-ray images‌​‌ through generative parallelization of​​ the Neural Architecture Search​​​‌

Participants: Michel Riveill.‌

Keywords: Deep Learning, Neural‌​‌ Architecture Search, X-ray

Collaborations:​​ Felix Mejía Cajicá, John​​​‌ Anderson García Henao, Carlos‌ Jaime Barrios Hernandéz

29‌​‌ explores Generic Neural Architecture​​ Search (GenNAS) for chest​​​‌ X-ray classification in lung‌ diseases, leveraging novel parallel‌​‌ training methods for enhanced​​ accuracy and efficiency. Medical​​​‌ image classification for pulmonary‌ pathologies from chest X-rays‌​‌ is traditionally time-consuming. GenNAS,​​ using GPT-4's generative capabilities,​​​‌ automates optimal architecture learning‌ from data. This study‌​‌ investigates parallelization and generative​​ algorithms to optimize neural​​​‌ network architectures for chest‌ X-ray classification, analyzing their‌​‌ impact on the NAS​​ algorithm using the ChexPert​​​‌ dataset. The study uses‌ the CheXpert dataset with‌​‌ 224,316 chest X-rays to​​ classify five lung disease​​​‌ pathologies. GenNASXRays evaluates 6561‌ architecture possibilities in an‌​‌ 8-layer search space, with​​ AUC-ROC and Precision-Recall plots​​​‌ as metrics. Training on‌ 187,641 images, the sequential‌​‌ algorithm took 190.2 hours​​ for an AUC-ROC of​​​‌ 0.869. In parallel execution‌ on two GPUs, an‌​‌ AUC-ROC of 0.87 was​​ achieved in 127.09 hours,​​​‌ highlighting the efficiency of‌ parallelization.

7.3.3 Parsimonious Gaussian‌​‌ mixture models with piecewise-constant​​ eigenvalue profiles

Participants: Pierre-Alexandre​​​‌ Mattei, Charles Bouveyron‌.

Keywords: GMM, Low-rank,‌​‌ Clustering, Denoising

Collaborations: Tom​​ Szwagier, Xavier Pennec

Gaussian​​​‌ mixture models (GMMs) are‌ ubiquitous in statistical learning,‌​‌ particularly for unsupervised problems.​​ While full GMMs suffer​​​‌ from the over-parameterization of‌ their covariance matrices in‌​‌ high-dimensional spaces, spherical GMMs​​ (with isotropic covariance matrices)​​​‌ certainly lack flexibility to‌ fit certain anisotropic distributions.‌​‌ Connecting these two extremes,​​ we introduce in 33​​​‌ a new family of‌ parsimonious GMMs with piecewise-constant‌​‌ covariance eigenvalue profiles. These​​ extend several low-rank models​​​‌ like the celebrated mixtures‌ of probabilistic principal component‌​‌ analyzers (MPPCA), by enabling​​ any possible sequence of​​​‌ eigenvalue multiplicities. If the‌ latter are pre-specified, then‌​‌ we can naturally derive​​ an expectation-maximization (EM) algorithm​​​‌ to learn the mixture‌ parameters. Otherwise, to address‌​‌ the notoriously-challenging issue of​​ jointly learning the mixture​​​‌ parameters and hyperparameters, we‌ propose a component-wise penalized‌​‌ EM algorithm, whose monotonicity​​ is proven. We show​​​‌ the superior likelihood-parsimony tradeoffs‌ achieved by our models‌​‌ on a variety of​​ unsupervised experiments: density fitting,​​​‌ clustering and single-image denoising.‌

7.3.4 A Layer Selection‌​‌ Approach to Test Time​​ Adaptation

Participants: Sabyasachi Sahoo​​​‌, Jonas Ngawe,‌ Frédéric Precioso.

Keywords:‌​‌ Deep learning, Test Time​​ Adaptation, Distribution shift

Collaborations:​​​‌ Mostafa Elaraby, Yann Batiste‌ Pequignot, Christian Gagné

Test‌​‌ Time Adaptation (TTA) addresses​​ the problem of distribution​​​‌ shift by adapting a‌ pretrained model to a‌​‌ new domain during inference.​​ When faced with challenging​​​‌ shifts, most methods collapse‌ and perform worse than‌​‌ the original pretrained model.​​ In 50, we​​​‌ find that not all‌ layers are equally receptive‌​‌ to the adaptation, and​​ the layers with the​​​‌ most misaligned gradients often‌ cause performance degradation. To‌​‌ address this, we propose​​ GALA, a novel layer​​​‌ selection criterion to identify‌ the most beneficial updates‌​‌ to perform during test​​​‌ time adaptation. This criterion​ can also filter out​‌ unreliable samples with noisy​​ gradients. Its simplicity allows​​​‌ seamless integration with existing​ TTA loss functions, thereby​‌ preventing degradation and focusing​​ adaptation on the most​​​‌ trainable layers. This approach​ also helps to regularize​‌ adaptation to preserve the​​ pretrained features, which are​​​‌ crucial for handling unseen​ domains. Through extensive experiments,​‌ we demonstrate that the​​ proposed layer selection framework​​​‌ improves the performance of​ existing TTA approaches across​‌ multiple datasets, domain shifts,​​ model architectures, and TTA​​​‌ losses.

7.4 Learning with​ heterogeneous and corrupted data​‌

7.4.1 Towards a fully​​ automated underwater census for​​​‌ fish assemblages in the​ Mediterranean Sea

Participants: Kilian​‌ Bürgi, Charles Bouveyron​​, Diane Lingrand,​​​‌ Frédéric Precioso.

Keywords:​ Diver operated video, Automated​‌ UVC, Deep learning, Object​​ detection, Marine biology, Marine​​​‌ protected areas

Collaborations: Cecile​ Sabourault, Benoit Derijard

In​‌ marine biology and ecology,​​ the collection of data​​​‌ relies mostly on diver​ operations, which is labor-intensive​‌ and financially costly to​​ operate, leading to reduced​​​‌ frequencies of data collection​ missions. Technological advances in​‌ the past decade have​​ made it available for​​​‌ unmanned robots to collect​ data, which resulted in​‌ numerous amounts of videos​​ that need to be​​​‌ manually evaluated and analyzed,​ which created a new​‌ bottleneck. In this study​​ 15, we explored​​​‌ the possibilities and differences​ between a manual and​‌ automated analysis of collected​​ videos by divers simulating​​​‌ a remotely operated vehicle​ (ROV). We discuss the​‌ difference between collecting data​​ by diver and by​​​‌ videos and found that​ both methods added species​‌ to the overall species​​ pool and that the​​​‌ automation was successful. This​ proof of concept will​‌ be used in future​​ studies to fasten the​​​‌ process of data analysis​ and allow more frequent​‌ data collections creating more​​ robust data for ecological​​​‌ decision making.

7.4.2 Automated​ Counting of Fish in​‌ Diver Operated Videos (DOV)​​ for Biodiversity Assessments

Participants:​​​‌ Kilian Bürgi, Charles​ Bouveyron, Diane Lingrand​‌, Frédéric Precioso.​​

Keywords: Underwater video, Fish​​​‌ count prediction, Temporal convolutional​ network, Object detection, Marine​‌ biology, Marine conservation

Collaborations:​​ Cecile Sabourault, Benoit Derijard​​​‌

Counting fish in moving​ underwater videos relies on​‌ labor-intensive manual counting or​​ imprecise metrics from stationary​​​‌ cameras, while there is​ a great potential to​‌ use better methods to​​ receive a better fish​​​‌ count. For this purpose,​ we explored traditional methods​‌ of counting fish as​​ well as introduced three​​​‌ new methods to count​ fish from computer vision​‌ derived data (single frame​​ detections). This resulted in​​​‌ a holistic and fully​ automated pipeline for fish​‌ abundance extraction 63.​​ The following different methods​​​‌ are proposed on transect​ data of three Mediterranean​‌ species with different ecological​​ niches: 1) traditional N​​​‌max,​ 2) 1d k-means clustering​‌ method, 3) an intuitive​​ clustering approach NH​​​‌euri​stic​‌ and 4) a Temporal​​ Convolutional Neural Networks (TCN)​​​‌ counting method. Our results​ show evidence of underestimation​‌ by the traditional N​​max while​​ the other methods showed​​​‌ better overall results with‌ the proposed NH‌​‌euri​​stic​​​‌ and TCN methods best‌ representing the reality. With‌​‌ an absolute variation comparable​​ to inter-observer variation, we​​​‌ demonstrated reliable methods for‌ quantifying fish counts.

7.4.3‌​‌ Topological data analysis and​​ multiple kernel learning for​​​‌ species identification of modern‌ and archaeological small ruminants‌​‌

Participants: Marco Corneli,​​ Davide Adamo.

Keywords:​​​‌ Shape Analysis, Point Clouds,‌ Classification, Multiple Kernel Learning,‌​‌ Topological Data Analysis

Collaborations:​​ Manon Vuillien, Emmanuelle Vila,​​​‌ Agraw Amane, Thierry Argant,‌ et al

The faunal‌​‌ remains from numerous Holocene​​ archaeological sites across southwest​​​‌ Asia frequently include the‌ bones of several wild‌​‌ and domestic ungulates, such​​ as sheep, goats, ibexes,​​​‌ roe deer and gazelles.‌ These assemblages may provide‌​‌ insight into hunting and​​ animal husbandry strategies and​​​‌ offer paleoecological information on‌ ancient human societies. However,‌​‌ the skeletons of these​​ taxa are highly similar​​​‌ in appearance, which presents‌ a challenge for accurate‌​‌ identification based on their​​ bones. This paper 37​​​‌ presents a case study‌ to test the potential‌​‌ of topological data analysis​​ (TDA) and multiple kernel​​​‌ learning (MKL) for inter-specific‌ identification of 150 3D‌​‌ astragali belonging to modern​​ and archaeological specimens. The​​​‌ joint application of TDA‌ and MKL demonstrated remarkable‌​‌ efficacy in accurately identifying​​ wild species, with a​​​‌ correct identification rate of‌ approximately 90%. In contrast,‌​‌ the identification of domestic​​ species exhibited a lower​​​‌ success rate, at approximately‌ 60%. The misidentification of‌​‌ sheep and goat species​​ is attributed to the​​​‌ morphological variability of domestic‌ breeds. Moreover, while these‌​‌ methods assist in clearly​​ identifying wild taxa from​​​‌ one another, they also‌ highlight their morphological diversity.‌​‌ In this context, TDA​​ and MKL could be​​​‌ invaluable for investigating intra-specific‌ variability in domestic and‌​‌ wild animals. These methods​​ offer a means of​​​‌ expanding our understanding of‌ past domestic animal selection‌​‌ practices and techniques. They​​ also facilitate an investigation​​​‌ into the morphological evolution‌ of wild animal populations‌​‌ over time (see Fig.​​ 7).

.2 .2​​​‌ .3
(a) (b) (c)‌
Figure 7.a
Figure 7.b
Point cloud
Figure 7.c
Persistence diagram‌​‌ (PD) 3D data analysis​​ pipeline via Topological Data​​​‌ Analysis (TDA).

The figure‌ shows two bone point‌​‌ clouds along with a​​ persistence plot of deaths​​​‌ vs births of topological‌ structures. Most points lie‌​‌ on the same diagonal.​​

Figure 7: Input​​​‌ 3D scan

7.4.4 A‌ Bayesian approach for clustering‌​‌ and exact finite-sample model​​ selection in longitudinal data​​​‌ mixtures

Participants: Marco Corneli‌.

Keywords: Clustering, Longitudinal‌​‌ Data, Mixture Models, Bayesian​​ Model Selection

Collaborations: Elena​​​‌ Erosheva, Marco Lorenzi, Xunlei‌ Qian

In 16,‌​‌ we consider mixtures of​​ longitudinal trajectories, where one​​​‌ trajectory contains measurements over‌ time of the variable‌​‌ of interest for one​​ individual and each individual​​​‌ belongs to one cluster.‌ The number of clusters‌​‌ as well as individual​​ cluster memberships are unknown​​​‌ and must be inferred.‌ We propose an original‌​‌ Bayesian clustering framework that​​ allows us to obtain​​​‌ an exact finite-sample model‌ selection criterion for selecting‌​‌ the number of clusters.​​​‌ Our finite-sample approach is​ more flexible and parsimonious​‌ than asymptotic alternatives such​​ as Bayesian information criterion​​​‌ or integrated classification likelihood​ criterion in the choice​‌ of the number of​​ clusters. Moreover, our approach​​​‌ has other desirable qualities:​ (i) it keeps the​‌ computational effort of the​​ clustering algorithm under control​​​‌ and (ii) it generalizes​ to several families of​‌ regression mixture models, from​​ linear to purely non-parametric.​​​‌ We test our method​ on simulated datasets as​‌ well as on a​​ real world dataset from​​​‌ the Alzheimer’s disease neuroimaging​ initative database.

7.4.5 Leveraging​‌ Concept Annotations for Trustworthy​​ Multimodal Video Interpretation through​​​‌ Modality Specialization

Participants: Elisa​ Ancarani, Julie Tores​‌, Rémy Sun,​​ Frédéric Precioso.

Keywords:​​​‌ Deep Learning, Multimedia, Scene​ understanding, Concepts, Multimodality

Collaborations:​‌ Lucille Sassatelli, Hui-Yin Wu​​

Multimodal datasets usually come​​​‌ as multimodal data annotated​ for a certain construct​‌ (such as depression). However,​​ for such tasks of​​​‌ video interpretation, models must​ not only make accurate​‌ predictions, but make them​​ for the right reasons.​​​‌ Ensuring model trustworthiness is​ however hampered by the​‌ lack of per-modality information.​​ We consider in 44​​​‌ the case of a​ recently introduced dataset for​‌ the video interpretation task​​ of detecting objectification, annotated​​​‌ for this end task​ along with multimodal explanatory​‌ concepts that provide per-modality​​ labels (see Fig. 8​​​‌). With such additional​ knowledge, we study how​‌ to design models with​​ both high task accuracy​​​‌ and modality trustworthiness. We​ first introduce the MSpecF​‌ framework articulating and fusing​​ a spectrum of variably​​​‌ specialized models, and two​ trustworthiness metrics. We show​‌ that modality-specialized models generally​​ maximize trustworthiness, and maximize​​​‌ task accuracy for confident​ modalities. For less certain​‌ modalities, task accuracy is​​ maximized by non-specialized models.​​​‌ We show that the​ full fusion of specialized​‌ models MSpecF(All*) achieves advantageous​​ trade-offs between task accuracy​​​‌ and trustworthiness compared to​ other fusion choices. This​‌ work shows that rich​​ per-modality annotations of moderate-size​​​‌ datasets allow to make​ more trustworthy models, essential​‌ for applications such as​​ supporting social scientists in​​​‌ analyzing complex social constructs.​

Figure 8

The image depicts a​‌ scene from a video​​ with four frames showing​​​‌ a person sitting and​ then standing up, alongside​‌ transcribed speech and a​​ checklist of elements such​​​‌ as type of shot,​ look, emotion, activities, and​‌ more. There's an analysis​​ of different models (Visual,​​​‌ VT, and Text) and​ their predictions, with corresponding​‌ graphs showing task accuracy​​ versus untrustworthiness. The visual​​​‌ model seems to be​ more accurate compared to​‌ the VT and Text​​ models.

Figure 8:​​​‌ Combining input modalities can​ sometimes lead to wrong​‌ decisions. In this movie​​ clip, a visual-only model​​​‌ correctly identifies the objectifying​ dimension, whereas a model​‌ combining text and visual​​ modalities fails to do​​​‌ so. Concept annotations reveal​ that objectification occurs only​‌ in the visual modality,​​ the addition of text​​​‌ indiscriminately caused the error.​ This paper shows that​‌ proper combinations of models​​ specialized for different modalities​​​‌ yield the best trade-off​ between task performance and​‌ trustworthiness as depicted in​​ the right hand plot.​​

7.4.6 CleverFish: An AI-driven​​​‌ Platform to Monitor and‌ Explore Marine Ecological Resources.‌​‌

Participants: Killian Bürgi,​​ Diane Lingrand, Rémy​​​‌ Sun, Charles Bouveyron‌.

Keywords: Deep Learning,‌​‌ Ecology, Object Dectection, Counting​​

Collaborations: Stéphane Petiot, Cécile​​​‌ Sabourault, Benoit Derijard

The‌ crucial need for reliable,‌​‌ robust and un-biased biodiversity​​ data in support of​​​‌ initiatives such as the‌ 30x30 initiative, which aims‌​‌ to conserve 30% of​​ the world's oceans by​​​‌ 2030, presents significant scientific‌ and technological challenges. There‌​‌ have been advances made​​ to automate fish biodiversity​​​‌ assessments using computer vision.‌ However, the stark difference‌​‌ in research fields between​​ ecology and artificial intelligence​​​‌ hinders the efficient use‌ of computer vision tools‌​‌ for ecological tasks. 45​​ presents Clever-Fish, a novel​​​‌ tool designed to bridge‌ the gap between artificial‌​‌ intelligence and marine biology​​ (see Fig. 9).​​​‌ CleverFish tackles three core‌ challenges of an efficient‌​‌ management tool: i) providing​​ an easy-to-use graphical user​​​‌ interface to an AI‌ pipeline, ii) accommodating global‌​‌ and video-specific in-app biodiversity​​ assessment and iii) allowing​​​‌ fast and efficient extraction‌ of temporal and spatial‌​‌ fish species distribution in​​ a format understandable for​​​‌ ecologists. An accessible web‌ application enables seamless integration‌​‌ into marine monitoring pipelines​​ and conservation efforts. constructs.​​​‌

Figure 9

The image illustrates a‌ process for analyzing underwater‌​‌ biodiversity from video recordings.​​ Users upload videos of​​​‌ marine environments. Software identifies‌ and labels species within‌​‌ the videos, providing a​​ frequency of occurrence graph.​​​‌ Results can be exported‌ to CSV for further‌​‌ biodiversity assessments. The workflow​​ involves uploading, analyzing, and​​​‌ exporting the data.

Figure‌ 9: Overview of‌​‌ the complete CleverFish pipeline​​ from upload to export.​​​‌

7.4.7 WolBanking77: Wolof Banking‌ Speech Intent Classification Dataset.‌​‌

Participants: Abdou Karim Kandji​​, Frédéric Precioso.​​​‌

Keywords: Deep Learning, Natural‌ Language Processing, Dataset, Wolof‌​‌

Collaborations: Cheikh Ba, Samba​​ Ndiaye, Augustin Ndione

Intent​​​‌ classification models have made‌ a significant progress in‌​‌ recent years. However, previous​​ studies primarily focus on​​​‌ high-resource language datasets, which‌ results in a gap‌​‌ for low-resource languages and​​ for regions with high​​​‌ rates of illiteracy, where‌ languages are more spoken‌​‌ than read or written.​​ This is the case​​​‌ in Senegal, for example,‌ where Wolof is spoken‌​‌ by around 90% of​​ the population, while the​​​‌ national illiteracy rate remains‌ at 42%. Wolof is‌​‌ actually spoken by more​​ than 10 million people​​​‌ in West African region.‌ To address these limitations,‌​‌ we introduce in 48​​ the Wolof Banking Speech​​​‌ Intent Classification Dataset (WolBanking77),‌ for academic research in‌​‌ intent classification. WolBanking77 currently​​ contains 9,791 text sentences​​​‌ in the banking domain‌ and more than 4‌​‌ hours of spoken sentences.​​ Experiments on various baselines​​​‌ are conducted in this‌ work, including text and‌​‌ voice state-of-the-art models. The​​ results are very promising​​​‌ on this current dataset.‌ In addition, this paper‌​‌ presents an in-depth examination​​ of the dataset's contents.​​​‌ We report baseline F1-scores‌ and word error rates‌​‌ metrics respectively on Natural​​ Language Processing (NLP) and​​​‌ Automatic Speech Recognition (ASR)‌ models trained on WolBanking77‌​‌ dataset and also comparisons​​​‌ between models.

7.4.8 Incorporating​ FATES Principles in Continuous​‌ Development of ML-Integrated Systems:​​ Importance of Requirements.

Participants:​​​‌ Nicolas Lacroix, Frédéric​ Precioso.

Keywords: Deep​‌ Learning, MLOps, FATES

Collaborations:​​ Jean-Michel Bruel, Tristan Gouaichault,​​​‌ Olivier Teste, Mireille Blay-Fornarino,​ Sébastien Mosser

The MLOps​‌ movement adopts the DevOps​​ objective of reducing the​​​‌ gap between development and​ operations teams by integrating​‌ data scientist teams and​​ Machine Learning (ML) models.​​​‌ In 53, the​ initial FATES-MLOps project aims​‌ to apply and adapt​​ good software engineering practices​​​‌ to enhance both the​ quality of the ML​‌ model construction processes and​​ the software systems produced,​​​‌ particularly with regard to​ extra-functional properties that will​‌ be critical issues: Fairness,​​ Accountability, Transparency, Ethics, and​​​‌ Security (FATES). This paper​ focuses specifically on the​‌ formalization, measurement, and management​​ of these properties throughout​​​‌ the MLOps process.

7.4.9​ Leveraging multimodal explanatory annotations​‌ for video interpretation with​​ Modality Specific Dataset

Participants:​​​‌ Elisa Ancarani, Julie​ Tores, Rémy Sun​‌, Frédéric Precioso.​​

Keywords: Deep Learning, Multimedia,​​​‌ Scene understanding, Concepts, Multimodality​

Collaborations: Lucille Sassatelli, Hui-Yin​‌ Wu

In 61,​​ we examine the impact​​​‌ of concept-informed supervision on​ multimodal video interpretation models​‌ using MOByGaze, a dataset​​ containing human-annotated explanatory concepts.​​​‌ We introduce Concept Modality​ Specific Datasets (CMSDs), which​‌ consist of data subsets​​ categorized by the modality​​​‌ (visual, textual, or audio)​ of annotated concepts. Models​‌ trained on CMSDs outperform​​ those using traditional legacy​​​‌ training in both early​ and late fusion approaches.​‌ Notably, this approach enables​​ late fusion models to​​​‌ achieve performance close to​ that of early fusion​‌ models. These findings underscore​​ the importance of modality-specific​​​‌ annotations in developing robust,​ self-explainable video models and​‌ contribute to advancing interpretable​​ multimodal learning in complex​​​‌ video analysis.

7.4.10 Using​ Small Language Models to​‌ Reverse-Engineer Machine Learning Pipelines​​ Structures

Participants: Nicolas Lacroix​​​‌, Frédéric Precioso.​

Keywords: Deep Learning, MLOps,​‌ FATES

Collaborations: Mireille Blay-Fornarino,​​ Sébastien Mosser

Extracting the​​​‌ stages that structure Machine​ Learning (ML) pipelines from​‌ source code is key​​ for gaining a deeper​​​‌ understanding of data science​ practices. However, the diversity​‌ caused by the constant​​ evolution of the ML​​​‌ ecosystem (e.g., algorithms, libraries,​ datasets) makes this task​‌ challenging. Existing approaches either​​ depend on non-scalable, manual​​​‌ labeling, or on ML​ classifiers that do not​‌ properly support the diversity​​ of the domain. These​​​‌ limitations highlight the need​ for more flexible and​‌ reliable solutions. In 70​​, we evaluate whether​​​‌ Small Language Models (SLMs)​ can leverage their code​‌ understanding and classification abilities​​ to address these limitations,​​​‌ and subsequently how they​ can advance our understanding​‌ of data science practices.​​ We conduct a confirmatory​​​‌ study based on two​ reference works selected for​‌ their relevance regarding current​​ state-of-the-art's limitations. First, we​​​‌ compare several SLMs using​ Cochran's Q test. The​‌ best-performing model is then​​ evaluated against the reference​​​‌ studies using two distinct​ McNemar's tests. We further​‌ analyze how variations in​​ taxonomy definitions affect performance​​​‌ through an additional Cochran's​ Q test. Finally, a​‌ goodness-of-fit analysis is conducted​​ using Pearson's chi-squared tests​​ to compare our insights​​​‌ on data science practices‌ with those from prior‌​‌ studies.

8 Bilateral contracts​​ and grants with industry​​​‌

8.1 Bilateral contracts with‌ industry

The team is‌​‌ particularly active in the​​ development of research contracts​​​‌ with private companies. The‌ following contracts were active‌​‌ during 2025:

  • Participants: Pierre-Alexandre​​ Mattei.

    Naval Group:​​​‌ The goal of this‌ project was the development‌​‌ of an open-source Python​​ library for semi-supervised learning,​​​‌ via the hiring of‌ a research engineer, Lucas‌​‌ Boiteau. External participants: Alexandre​​ Gensse, Quentin Oliveau (Naval​​​‌ Group). Amount: 125k€. The‌ contract ended in March‌​‌ 2025

  • Participants: Pierre-Alexandre Mattei​​.

    Pulse Audition: This​​​‌ contract was the fruit‌ of the "start it‌​‌ up" program of the​​ 3IA Côte d'Azur. The​​​‌ goal is to work‌ on semi-supervised learning for‌​‌ hearing glasses. A research​​ engineer (Léonie Borne) was​​​‌ recruited via the "start‌ it up" program. Amount:‌​‌ 15 000€. The contract​​ ended in August 2025.​​​‌

8.2 Bilateral grants with‌ industry

The team is‌​‌ also active in the​​ development of research projects​​​‌ with private companies.

  • France‌ 2030 Project, « Accélération‌​‌ des usages de l’IA​​ générative », BPI Grant​​​‌

    Participants: Frederic Precioso,‌ Remy Sun, Diane‌​‌ Lingrand.

    The project​​ Logie IA aims both​​​‌ (i) at creating a‌ value chain for interactive‌​‌ generative AI in social​​ logistics robotics and beyond,​​​‌ demonstrated by an optimized‌ Large Language Model (LLM)‌​‌ on a robot and​​ a dedicated processor with​​​‌ advanced voice control, and‌ (ii) at creating an‌​‌ open source sandbox that​​ complies with European laws​​​‌ to help build an‌ open source community. The‌​‌ consortium is composed of​​ Enchanted Tools (Leader), Inria​​​‌ Maasai, Sorbonne Université ISIR,‌ Avignon Université, NXP. Hugging‌​‌ Face is a non-funded​​ partner of the consortium.​​​‌ Grant amount: 3.7M€ overall,‌ 378k€ for Maasai.

  • BPI‌​‌ Grant for the project​​ C02PILOT

    Participants: Pierre-Alexandre Mattei​​​‌.

    The project CO2PILOT‌ aspires to create the‌​‌ go-to solution for reliable​​ CO₂ emissions monitoring and​​​‌ data analysis for industrial‌ sites. The consortium is‌​‌ composed of QAIrbon (Leader),​​ Inria Maasai, APAVE, ACRI-ST,​​​‌ and the Laboratoire Atmosphères,‌ Milieux, Observations Spatiales (LATMOS).‌​‌ Grant amount: 233k€ for​​ Maasai.

9 Partnerships and​​​‌ cooperations

9.1 International initiatives‌

The Maasai team has‌​‌ informal relationships with the​​ following international teams:

  • School​​​‌ of Mathematics and Statistics,‌ University College Dublin (Ireland)‌​‌ through the collaborations with​​ Brendan Murphy, Riccardo Rastelli​​​‌ and Michael Fop,
  • Université‌ Laval, Québec (Canada) through‌​‌ the Research Program DEEL​​ (DEpendable and Explainable Learning)​​​‌ with Christian Gagné, and‌ through Arnaud Droit,
  • DTU‌​‌ Compute, Technical University of​​ Denmark, Copenhagen (Denmark), through​​​‌ collaborations with Jes Frellsen‌ and his team (including‌​‌ the co-supervision of a​​ PhD student in Denmark:​​​‌ Hugo Sénétaire, who defended‌ in 2025),
  • Department of‌​‌ Statistics of the University​​ of Washington, Seattle (USA)​​​‌ through collaborations with Elena‌ Erosheva and Adrian Raftery,‌​‌
  • SAILAB team at Università​​ di Siena, Siena (Italy)​​​‌ through collaborations with Marco‌ Gori.

9.2 International research‌​‌ visitors

9.2.1 Visits of​​ international scientists

Inria International​​​‌ Chair
  • Pr. Arnaud Droit,‌ Université Laval, Canada (Inria‌​‌ Int. Chair)
Other international​​​‌ visits to the team​
  • Jes Frellsen (Technical University​‌ of Denmark) and Jakub​​ Tomczak (Chan Zuckerberg Initiative)​​​‌ visited the team while​ co-organizing GeMSS/Statlearn 2025 spring​‌ school, held at​​ Inria in April 2025.​​​‌
  • Antonio Canale (University of​ Padova) visited the team​‌ in May 2025.

9.2.2​​ Visits to international teams​​​‌

Research stays abroad
  • Mariam​ Grigoryan has stayed for​‌ 3 months as a​​ visiting scholar the Broad​​​‌ Institute of MIT and​ Harvard in March-July 2025.​‌
  • Pierre-Alexandre Mattei did three​​ week-long visits to Jes​​​‌ Frellsen's team at DTU.​
  • Frederic Precioso has visited​‌ the team of Christian​​ Gagné at Université Laval,​​​‌ Québec, Canada, in November​ 2025.

9.3 National initiatives​‌

IA Cluster "Institut 3IA​​ Côte d'Azur"

Participants: Charles​​​‌ Bouveyron, Pierre-Alexandre Mattei​, Vincent Vandewalle.​‌

Following the call of​​ President Macron to found​​​‌ several national institutes in​ AI, we presented in​‌ front of an international​​ jury our project for​​​‌ the Institut 3IA Côte​ d'Azur in April 2019.​‌ The project was selected​​ for funding (50 M€​​​‌ for the first 4​ years, including 16 M€​‌ from the PIA program)​​ and started in September​​​‌ 2019. Charles Bouveyron are​ two of the 29​‌ 3IA chairs which were​​ selected ab initio by​​​‌ the international jury and​ Pierre-Alexandre Mattei was awarded​‌ a 3IA chair in​​ 2021, and Vincent Vandewalle​​​‌ in 2022. Charles Bouveyron​ was the Director of​‌ the institute since January​​ 2021 until October 2025,​​​‌ after being the Deputy​ Scientific Director on 2019-2020.​‌ The research of the​​ institute is organized around​​​‌ 4 thematic axes: Core​ elements of AI, Computational​‌ Medicine, AI for Biology​​ and Smart territories. The​​​‌ Maasai reserch team is​ totally aligned with the​‌ first axis of the​​ Institut 3IA Côte d'Azur​​​‌ and also contributes to​ the 3 other axes​‌ through applied collaborations. The​​ team has several Ph.D.​​​‌ students and postdocs who​ are directly funded by​‌ the institute. The institute​​ was renewed in 2024​​​‌ for an additional period​ of 5 years, under​‌ the IA Cluster label.​​

Web site: 3ia.univ-cotedazur.eu

ANR​​​‌ Project MultiTrans

Participants: Diane​ Lingrand, Frederic Precioso​‌, Remy Sun.​​

Partners:

  • Valeo.ai
  • INSA Rouen​​​‌

In the MultiTrans project,​ we propose to tackle​‌ autonomous driving algorithms development​​ and deployment jointly. The​​​‌ idea is to enable​ data, experience and knowledge​‌ to be transferable across​​ the different systems (simulation,​​​‌ robotic models, and real-word​ cars), thus potentially accelerating​‌ the rate at which​​ an embedded intelligent system​​​‌ can gradually learn to​ operate at each deployment​‌ stage. Existing autonomous vehicles​​ are able to learn​​​‌ how to react and​ operate in known domains​‌ autonomously but research is​​ needed to help these​​​‌ systems during the perception​ stage, allowing them to​‌ be operational and safer​​ in a wider range​​​‌ of situations. MultiTrans proposes​ to address this issue​‌ by developing an intermediate​​ environment that allows to​​​‌ deploy algorithms in a​ physical world model, by​‌ re-creating more realistic use​​ cases that would contribute​​​‌ to a better and​ faster transfer of perception​‌ algorithms to and from​​ a real autonomous vehicle​​ test-bed and between multiple​​​‌ domains.

Web site: anr-multitrans.github.io‌

ANR Project FATE-MLOps

Participants:‌​‌ Frederic Precioso.

Partners:​​

  • IRIT Toulouse
  • I3S Sophia​​​‌ Antipolis
  • McMaster, Ontario, Canada‌

The MLOps movement adopts‌​‌ the DevOps objective of​​ reducing the gaps between​​​‌ development and operations teams‌ by integrating data scientist‌​‌ teams and Machine Learning​​ (ML) models. In this​​​‌ project, we wish to‌ apply and adapt good‌​‌ software engineering practices to​​ strengthen both the overall​​​‌ quality of the ML‌ model construction processes and‌​‌ the quality of the​​ software systems produced, particularly​​​‌ in terms of extra-functional‌ properties that will become‌​‌ crucial issues: Fairness, Accountability,​​ Transparency, Ethics, and Security​​​‌ (FATES). The key concerns‌ will tackle the study,‌​‌ formalization, measurement, and management​​ of these properties throughout​​​‌ the continuous MLOps process.‌ Indeed, more than traditional‌​‌ Key Performance Indicators (KPIs),​​ such as precision and​​​‌ recall, are required to‌ evaluate models' robustness in‌​‌ practical applications. Our project​​ aims to study the​​​‌ FATES properties and, by‌ refining proven software engineering‌​‌ concepts and tools, propose​​ a systematic and tailored​​​‌ approach for considering those‌ properties, particularly from the‌​‌ lens of ML Scientists​​ or ML Engineers, throughout​​​‌ the lifecycle of the‌ software developed following an‌​‌ MLOps approach.

Web site:​​ fates-mlops.org

ANR Project PROFILE​​​‌

Participants: Frederic Precioso.‌

Partners:

  • INRIA Lille
  • I3S‌​‌ Sophia Antipolis

In this​​ project, we adopt a​​​‌ Software Engineering (SE) approach‌ to the problem of‌​‌ characterizing ML workflows from​​ notebooks to a level​​​‌ of abstraction which allows‌ code checking and code‌​‌ sharing. We propose to​​ link model engineering (MDE,​​​‌ here “model” in the‌ SE sense) and statistical‌​‌ and static analyses to​​ characterize these ML workflows​​​‌ by models (also in‌ the SE sense), which‌​‌ we now call PROFILES​​ hereafter.

More specifically, our​​​‌ project aims to explore‌ three complementary questions: (Q1)‌​‌ What information can and​​ should be automatically extracted​​​‌ from a Notebook to‌ build a profile for‌​‌ its analysis? (Q2) Is​​ it possible to systematically​​​‌ identify typical errors from‌ the profile (for example,‌​‌ the use of functions​​ unsuited to the problem​​​‌ at hand) and to‌ identify bad practices (for‌​‌ example, the use of​​ test data for training)?​​​‌ (Q3) Can we exploit‌ the profusion of Notebooks‌​‌ to accelerate ML research​​ by encouraging, on the​​​‌ basis of extracted PROFILES,‌ a pooling of knowledge‌​‌ and the elicitation of​​ new good/bad practices? It​​​‌ is important to note‌ that in this project,‌​‌ we are talking about​​ verifying quality rules in​​​‌ the way a code‌ linter would, and not‌​‌ in the sense of​​ formally verifying properties.

ANR​​​‌ PEPR NumpEX project HPC-Sage‌

Participants: Diane Lingrand,‌​‌ Remy Sun, Pierre-Alexandre​​ Mattei, Frederic Precioso​​​‌.

Partners:

  • Université de‌ Strasbourg (UniStra)
  • Inria Callisto‌​‌ (Sophia), Acumes (Sophia) and​​ Makutu (Pau)

The SAGE-HPC​​​‌ project aims to develop‌ a scalable, open, and‌​‌ interoperable software platform for​​ multi-fidelity optimization of complex​​​‌ physical problems covering exascale‌ in high-performance computing (HPC)‌​‌ environments. Solving this type​​ of optimization problem represents​​​‌ a major scientific challenge‌ due to the complexity‌​‌ of the physical phenomena​​​‌ modeled and the computational​ cost associated with high-fidelity​‌ simulations. To overcome this​​ difficulty, the project relies​​​‌ both on the coordinated​ use of variable fidelity​‌ models—where simplified, low-cost models​​ guide the exploration of​​​‌ the solution space, and​ high-fidelity models are used​‌ in a targeted manner​​ to refine the results​​​‌ — and on the​ massive exploitation of exascale​‌ HPC resources, enabling the​​ parallel processing of these​​​‌ approaches on a large​ scale.

More specifically, we​‌ propose to tackle this​​ question through the study​​​‌ of deep learning techniques​ for these exascale problems​‌ and the collection of​​ meta-learning insights on the​​​‌ process of training neural​ networks across different types​‌ of physical problems.

9.4​​ Regional initiatives

Centre de​​​‌ pharmacovigilance, CHU Nice

Participants:​ Charles Bouveyron, Alexandre​‌ Destere, Marco Corneli​​, Michel Riveill.​​​‌

Collaborators: Milou-Daniel Drici

The​ team works very closely​‌ with the Regional Pharmacovigilance​​ Center of the University​​​‌ Hospital Center of Nice​ (CHU) through several projects.​‌ The first project focuses​​ on the construction of​​​‌ a dashboard to classify​ spontaneous patient and professional​‌ reports, but above all​​ to report temporal breaks.​​​‌ To this end, we​ are studying the use​‌ of dynamic co-classification techniques​​ to both detect significant​​​‌ ADR patterns and identify​ temporal breaks in the​‌ dynamics of the phenomenon.​​ The second project focuses​​​‌ on the analysis of​ medical reports in order​‌ to extract, when present,​​ the adverse events for​​​‌ characterization. After studying a​ supervised approach, we are​‌ studying techniques requiring fewer​​ annotations.

10 Dissemination

10.1​​​‌ Promoting scientific activities

10.1.1​ Scientific events: organization

  • Charles​‌ Bouveyron , Marco Corneli​​ , Pierre-Alexandre Mattei ,​​​‌ Remy Sun , and​ Vincent Vandewalle were local​‌ organizers of the 32nd​​ Summer Working Group on​​​‌ Model-Based Clustering, held​ at Inria in July​‌ 2025.
  • Diane Lingrand and​​ Pierre-Alexandre Mattei were part​​​‌ of the scientific council​ of the SophI.A Summit​‌ 2025. Pierre-Alexandre Mattei​​ was head of the​​​‌ council.
  • Charles Bouveyron ,​ Marco Corneli , and​‌ Pierre-Alexandre Mattei were co-organizers​​ of the GeMSS/Statlearn 2025​​​‌ spring school, held​ at Inria in April​‌ 2025.
  • Pierre-Alexandre Mattei was​​ a co-organizer of the​​​‌ workshop GenU 2025.​
  • Pierre-Alexandre Mattei was a​‌ co-organizer of a joint​​ day of seminars between​​​‌ the Probability and Statistics​ team from the LJAD​‌ and Maasai.
  • Pierre-Alexandre Mattei​​ was a co-organizer of​​​‌ the first edition of​ EurIPS.
  • Frederic Precioso​‌ , Remy Sun and​​ Diane Lingrand were organizers​​​‌ of the Deep Learning​ School held at SophiaTech​‌ in June and July​​ 2025.

10.1.2 Scientific events:​​​‌ selection

  • Pierre-Alexandre Mattei is​ an area chair for​‌ the conferences NeurIPS and​​ ICML.
  • Most members of​​​‌ the team regularly review​ papers for major ML/CV​‌ conferences.

10.1.3 Journal

  • Charles​​ Bouveyron is associate editor​​​‌ of the Annals of​ Applied Statistics
  • Most members​‌ of the team regularly​​ review papers for ML/CV/stats​​​‌ journals.

10.1.4 Invited talks​

10.1.5​​ Leadership within the scientific​​​‌ community

  • Charles Bouveyron has‌ been the Director of‌​‌ the Institut 3IA Côte​​ d'Azur from January 2021​​​‌ to October 2025.
  • Vincent‌ Vandewalle is the Deputy‌​‌ Scientific director of the​​ EFELIA Côte d'Azur education​​​‌ program since September 2022.‌

10.1.6 Scientific expertise

  • Charles‌​‌ Bouveyron is member of​​ the Scientific Orientation Council​​​‌ of Centre Antoine Lacassagne,‌ Unicancer center of Nice.‌​‌

10.1.7 Research administration

  • Charles​​ Bouveyron administered the Institut​​​‌ 3IA Côte d'Azur as‌ director (20 M€ per‌​‌ 5 years).

10.2 Teaching​​ - Supervision - Juries​​​‌ - Educational and pedagogical‌ outreach

10.2.1 Supervision

The‌​‌ team has 5 senior​​ researchers with HDR that​​​‌ are able to supervise‌ Ph.D. students. Usually, the‌​‌ supervision of the Ph.D.​​ students of the team​​​‌ is jointly made by‌ a senior and a‌​‌ junior researchers of the​​ team. The following Maasai​​​‌ PhD students defended in‌ 2025:

  • Hugo Senetaire (co-supervised‌​‌ by Pierre-Alexandre Mattei and​​ Jes Frellsen) defended in​​​‌ May 2025, at‌ the Technical University of‌​‌ Denmark. The committee was​​ composed of Georgios Arvanitidis,​​​‌ Lars Kai Hansen, Thomas‌ Schön, and Samek Wojciech.‌​‌
  • Kilian Burgi (co-supervised by​​ Charles Bouveyron and Cécile​​​‌ Sabourault), Detection and monitoring‌ of marine biodiversity by‌​‌ artificial intelligence.

10.2.2 Juries​​

All senior members of​​​‌ the team are actively‌ involved in the supervision‌​‌ of postdocs, Ph.D. students,​​ interns and participate frequently​​​‌ to Ph.D. and HDR‌ defenses.

10.3 Popularization

  • Charles‌​‌ Bouveyron , Kilian Burgi,​​ Remy Sun and Diane​​​‌ Lingrand have released the‌ web platform CleverFish allowing‌​‌ to bridge the gap​​ between marine biology and​​​‌ AI. CleverFish tackles three‌ core challenges of an‌​‌ efficient management tool: providing​​ an easy-to-use graphical user​​​‌ interface. accommodating global and‌ video-specific in app biodiversity‌​‌ assessment. allowing fast and​​ efficient extraction of temporal​​​‌ and spatial fish species‌ distribution in a format‌​‌ understandable for ecologists.
  • Charles​​ Bouveyron , Frederic Precioso​​​‌ and Vincent Vandewalle participated‌ in a series of‌​‌ TV documentaries on Artificial​​ Intelligence for TV Monaco​​​‌ and TV5 Monde.
  • Frederic‌ Precioso gave a full‌​‌ day tutorial on "Building​​ your own LLM from​​​‌ Scratch" (part1 &‌ part2).

11 Scientific production

11.1‌​‌ Major publications

11.2​​ Publications of the year​​​‌

International journals

Invited conferences​‌

  • 38 inproceedingsC.Christophe​​ Biernacki and V.Vincent​​​‌ Vandewalle. An EM​ Stopping Rule for Avoiding​‌ Degeneracy in Gaussian-Based Clustering​​ withMissing Data..Journée​​​‌ PS-MAASAI 2025Nice, France​April 2025HALback​‌ to text
  • 39 inproceedings​​V.Vincent Vandewalle,​​​‌ M.Marie Du Roy​ de Chaumaray and M.​‌Matthieu Marbac. Non-parametric​​ Multi-Partitions Clustering.Studies​​​‌ in Classification, Data Analysis,​ and Knowledge OrganizationCLADAG​‌ 2025 - 15th Scientific​​ Meeting of the CLAssification​​​‌ and Data Analysis Group​Naples (Italie), ItalySeptember​‌ 2025HALback to​​ text
  • 40 inproceedingsV.​​​‌Vincent Vandewalle, M.​Matthieu Marbac and M.​‌Marie Du Roy de​​ Chaumaray. Multiple partition​​​‌ clustering.Working Group​ on Model Based Clustering​‌ 2025Valbonne Sophia-Antipolis, France​​July 2025HALback​​​‌ to text
  • 41 inproceedings​M.Manon Vuillien,​‌ M.Marco Corneli,​​ D.Davide Adamo,​​​‌ T.Thierry Argant,​ J.Jwana Chahoud,​‌ K.Karyne Debue,​​ C.Camille Lamarque,​​​‌ M.Marjan Mashkour,​ T.Tim Mibord,​‌ M.Michaël Seigle and​​ E.Emmanuelle Vila.​​​‌ Rencontre entre l’archéozoologie et​ les mathématiques : vers​‌ de nouvelles approches en​​ anatomie comparée computationnelle.​​​‌IA et innovations numériques​ : usages et enjeux​‌ en archéologieChartres, France​​November 2025HAL
  • 42​​​‌ inproceedingsM.Manon Vuillien​, M.Marco Corneli​‌, D.Davide Adamo​​ and E.Emmanuelle Vila​​​‌. Zooarchaeology & machine​ learning: a promising matching​‌.Prospective Biodiversité et​​ Intelligence ArtificielleParis, France​​​‌October 2025HAL

International​ peer-reviewed conferences

Conferences without proceedings

Scientific book​​ chapters

  • 58 inbookA.​​​‌Alessandro Betti, M.​Michele Casoni and M.​‌Marco Gori. Forward​​ Approximate Solution for Linear​​​‌ Quadratic Tracking.428​Advanced Neural Artificial Intelligence:​‌ Theories and ApplicationsSmart​​ Innovation, Systems and Technologies​​​‌Springer Nature Singapore2025​, 55-69HALDOI​‌

Edition (books, proceedings, special​​ issue of a journal)​​​‌

  • 59 proceedingsI.Isabelle​ Théry-Parisot, V. L.​‌Vanna Lisa Coli,​​ L.Luca Calatroni and​​ M.Marco Corneli,​​​‌ eds. Special issue :‌ CULHER_Advances in artificial intelligence‌​‌ and quantitative methods for​​ archaeology and art history​​​‌.IAMAHA 2023 -‌ first international conference on‌​‌ artificIAl Intelligence and applied​​ MAthematics for History and​​​‌ Archaeology75Journal of‌ Cultural Heritagespecial issue‌​‌NIce, FranceElsevier2025​​HALDOI

Reports &​​​‌ preprints

Other scientific​​​‌ publications