EN FR
EN FR
THOTH - 2025

2025​‌Activity reportProject-TeamTHOTH​​

RNSR: 201622034K

Creation of the​‌ Project-Team: 2016 March 01​​

Each year, Inria research​​​‌ teams publish an Activity​ Report presenting their work​‌ and results over the​​ reporting period. These reports​​​‌ follow a common structure,​ with some optional sections​‌ depending on the specific​​ team. They typically begin​​​‌ by outlining the overall​ objectives and research programme,​‌ including the main research​​ themes, goals, and methodological​​​‌ approaches. They also describe​ the application domains targeted​‌ by the team, highlighting​​ the scientific or societal​​​‌ contexts in which their​ work is situated.

The​‌ reports then present the​​ highlights of the year,​​​‌ covering major scientific achievements,​ software developments, or teaching​‌ contributions. When relevant, they​​ include sections on software,​​​‌ platforms, and open data,​ detailing the tools developed​‌ and how they are​​ shared. A substantial part​​​‌ is dedicated to new​ results, where scientific contributions​‌ are described in detail,​​ often with subsections specifying​​​‌ participants and associated keywords.​

Finally, the Activity Report​‌ addresses funding, contracts, partnerships,​​ and collaborations at various​​​‌ levels, from industrial agreements​ to international cooperations. It​‌ also covers dissemination and​​ teaching activities, such as​​​‌ participation in scientific events,​ outreach, and supervision. The​‌ document concludes with a​​ presentation of scientific production,​​​‌ including major publications and​ those produced during the​‌ year.

Keywords

Computer Science​​ and Digital Science

  • A3.4.​​​‌ Machine learning and statistics​
  • A5.3. Image processing and​‌ analysis
  • A5.9. Signal processing​​
  • A6.2.6. Optimization
  • A8.2. Optimization​​​‌
  • A9.2. Machine learning
  • A9.3.​ Signal processing
  • A9.7. AI​‌ algorithmics
  • A9.11. Generative AI​​
  • A9.12. Computer vision

1 Team members, visitors,​​ external collaborators

Research Scientists​​​‌

  • Julien Mairal [Team​ leader, Inria,​‌ Senior Researcher, en​​ détachement du corps des​​ Mines, HDR]​​​‌
  • Karteek Alahari [Inria‌, Senior Researcher,‌​‌ HDR]
  • Michael Arbel​​ [Inria, Researcher​​​‌]
  • Pia Bideau [‌UGA, Chair]‌​‌
  • Jocelyn Chanussot [Inria​​, Senior Researcher,​​​‌ en détachement Grenoble INP‌, HDR]
  • Emanuele‌​‌ Dalsasso [Inria,​​ ISFP, from Dec​​​‌ 2025]
  • Pierre Gaillard‌ [Inria, Researcher‌​‌, HDR]
  • Hadrien​​ Hendrikx [Inria,​​​‌ Researcher]

Post-Doctoral Fellows‌

  • Alessia Boccalatte [UGA‌​‌, Post-Doctoral Fellow,​​ until Jan 2025]​​​‌
  • Khaled Eldowa [Inria‌, Post-Doctoral Fellow,‌​‌ from Oct 2025]​​
  • Charles-Gerard Lucas [Inria​​​‌, Post-Doctoral Fellow,‌ from Oct 2025]‌​‌
  • Giacomo Meanti [Inria​​, Post-Doctoral Fellow]​​​‌
  • Romain Menegaux [Inria‌, until Jan 2025‌​‌]
  • Scott Pesme [​​Inria, Post-Doctoral Fellow​​​‌]

PhD Students

  • Yedidia‌ Agnimo [Ekimetrics,‌​‌ CIFRE, from Jul​​ 2025]
  • Loic Arbez​​​‌ [GRENOBLE INP]‌
  • Eyal Benaroche [Meta‌​‌, from Nov 2025​​]
  • Tariq Berrada Ifriqi​​​‌ [Meta, CIFRE‌]
  • Theo Bodrito [‌​‌Inria, until Jul​​ 2025, with Willow​​​‌]
  • Timothee Darcet [‌Meta, CIFRE,‌​‌ until Feb 2025]​​
  • Fares El Khoury [​​​‌Inria]
  • Renaud Gaucher‌ [Ecole Polytechnique]‌​‌
  • Bilal Yagiz Gündeger [​​UGA, from Nov​​​‌ 2025]
  • Vincent Herfeld‌ [ENHANCE LAB,‌​‌ CIFRE, from May​​ 2025]
  • Emmanuel Jehanno​​​‌ [Inria]
  • Zhiqi‌ Kang [Inria,‌​‌ until Sep 2025]​​
  • Paul Liautaud [Sorbonne​​​‌ Univ]
  • Bianca Marin‌ Moreno [EDF,‌​‌ CIFRE, until Nov​​ 2025]
  • Juliette Marrie​​​‌ [NAVER LABS Europe‌, CIFRE, until‌​‌ Jul 2025]
  • Ieva​​ Petrulionyte [UGA]​​​‌
  • François Porcher [Meta‌, CIFRE, from‌​‌ Apr 2025, with​​ WILLOW]
  • Colin Prieur​​​‌ [Univ Montpellier,‌ until Oct 2025]‌​‌
  • Romain Seailles [ENS​​ Paris, with Willow​​​‌]
  • Amogh Tiwari [‌UGA]
  • Eloise Touron‌​‌ [Inria]
  • Kenta​​ Vert [UGA,​​​‌ from Sep 2025]‌
  • Julien Zhou [Criteo‌​‌, CIFRE]

Technical​​ Staff

  • Juliette Bertrand [​​​‌Inria, Engineer,‌ until Oct 2025]‌​‌
  • Julien Horvat [UGA​​]
  • Noé Peterlongo [​​​‌INPG SA, Engineer‌, until Jan 2025‌​‌]
  • Thomas Ryckeboer [​​Inria, Engineer]​​​‌
  • Mathis Tailland [UGA‌, until Oct 2025‌​‌]

Interns and Apprentices​​

  • Theodore Batte [Polytech​​​‌ Grenoble, Intern,‌ until Mar 2025]‌​‌
  • Augustin Cablant [Criteo​​, Intern, from​​​‌ May 2025 until Nov‌ 2025]
  • Romain Forestier‌​‌ [UGA, Intern​​, from May 2025​​​‌ until Jun 2025]‌
  • Manuela Giraldo Obando [‌​‌INPG SA, Intern​​, from Jun 2025​​​‌ until Jul 2025]‌
  • Quentin Goizet [Polytech‌​‌ Grenoble, Intern]​​
  • Geraud Ilinca [Inria​​​‌, Intern, from‌ Mar 2025 until Sep‌​‌ 2025]
  • Lucas Montigon​​ [Polytech Grenoble,​​​‌ Intern, until Mar‌ 2025]
  • Carlos Inaki‌​‌ Roman Martinez [UGA​​​‌, Intern, from​ Feb 2025 until Jul​‌ 2025]
  • Morgan Scalabrino​​ [Inria, Intern​​​‌, from Apr 2025​ until Aug 2025]​‌
  • Kenta Vert [Inria​​, Intern, from​​​‌ Apr 2025 until Aug​ 2025]

Administrative Assistant​‌

  • Nathalie Gillot [Inria​​]

Visiting Scientists

  • Nassim​​​‌ Ait Ali Braham [​DLR, until Oct​‌ 2025]
  • Yusuf Mehmet​​ Colak [UNIV PAVIE​​​‌, from Oct 2025​]
  • François Postic [​‌INRAE, until Sep​​ 2025]
  • Francesca Razzano​​​‌ [Univ Padova,​ from Oct 2025]​‌

External Collaborator

  • Olivier Flasseur​​ [CNRS]

2​​​‌ Overall objectives

Thoth is​ a computer vision and​‌ machine learning team. Our​​ initial goal was to​​​‌ develop machine learning models​ for analyzing the massive​‌ amounts of visual data​​ that are currently available​​​‌ on the web. Then,​ the focus of the​‌ team has become more​​ diverse. More precisely, we​​​‌ share a common objective​ of developing machine learning​‌ models that are robust​​ and efficient (in terms​​​‌ of computational cost and​ data requirements).

Our main​‌ research directions are the​​ following ones:

  • visual understanding​​​‌ from limited annotations and​ data: Many state-of-the-art computer​‌ vision models are typically​​ trained on a huge​​​‌ corpus of fully annotated​ data. We want to​‌ reduce the cost by​​ developing new algorithms for​​​‌ unsupervised, self-supervised, continual, or​ incremental learning.
  • efficient deep​‌ learning models, from theory​​ to applications: We​​​‌ want to invent a​ new generation of machine​‌ learning models (in particular​​ deep learning) with theoretical​​​‌ guarantees, efficient algorithms, and​ a wide range of​‌ applications. We develop for​​ instance models for images,​​​‌ videos, graphs, or sequences.​
  • statistical machine learning and​‌ optimization: we are also​​ developing efficient machine learning​​​‌ methods, with a focus​ on stochastic optimization for​‌ processing large-scale data, and​​ online learning.
  • pluri-disciplinary collaborations:​​​‌ Machine learning being at​ the crossing of several​‌ disciplines, we have successfully​​ conducted collaborations in scientific​​​‌ domains that are relatively​ far from our domains​‌ of expertise. These fields​​ are producing massive amounts​​​‌ of data and are​ in dire needs of​‌ efficient tools to make​​ predictions or interpretations. For​​​‌ example, we have had​ the chance to collaborate​‌ with many colleagues from​​ natural language processing, robotics,​​​‌ neuroimaging, computational biology, genomics,​ astrophysics for exoplanet detections,​‌ and we are currently​​ involved in several remote​​​‌ sensing and hyperspectral imaging​ projects thanks to Jocelyn​‌ Chanussot (hosted by Thoth​​ in the 2019 to​​​‌ 2022 period, now an​ INRIA senior scientist on​‌ leave from Grenoble INP​​ since september 2023 ).​​​‌

3 Research program

3.1​ Designing and learning structured​‌ models

The task of​​ understanding image and video​​​‌ content has been interpreted​ in several ways over​‌ the past few decades,​​ namely image classification, detecting​​​‌ objects in a scene,​ recognizing objects and their​‌ spatial extents in an​​ image, recovering scene geometry.​​​‌ However, addressing all these​ problems individually provides us​‌ with a partial understanding​​ of the scene at​​​‌ best, leaving much of​ the visual data unexplained.​‌

One of the main​​ goals of this research​​ axis is to go​​​‌ beyond the initial attempts‌ that consider only a‌​‌ subset of tasks jointly,​​ by developing novel models​​​‌ for a more complete‌ understanding of scenes to‌​‌ address all the component​​ tasks. We propose to​​​‌ incorporate the structure in‌ image and video data‌​‌ explicitly into the models.​​ In other words, our​​​‌ models aim to satisfy‌ the complex sets of‌​‌ constraints that exist in​​ natural images and videos.​​​‌ Examples of such constraints‌ include: (i) relations between‌​‌ objects, like signs for​​ shops indicate the presence​​​‌ of buildings, (ii) higher-level‌ semantic relations involving the‌​‌ type of scene, geographic​​ location, and the plausible​​​‌ actions as a global‌ constraint, e.g., an image‌​‌ taken at a swimming​​ pool is unlikely to​​​‌ contain cars, (iii) relating‌ objects occluded in some‌​‌ of the video frames​​ to content in other​​​‌ frames, where they are‌ more clearly visible as‌​‌ the camera or the​​ object itself move, with​​​‌ the use of long-term‌ trajectories and video object‌​‌ proposals.

This research axis​​ will focus on two​​​‌ topics. The first is‌ developing deep features for‌​‌ video. This involves designing​​ rich features available in​​​‌ the form of long-range‌ temporal interactions among pixels‌​‌ in a video sequence​​ to learn a representation​​​‌ that is truly spatio-temporal‌ in nature. The second‌​‌ topic is aimed at​​ learning models that capture​​​‌ the relationships among several‌ objects and regions in‌​‌ a single image scene,​​ and additionally, among scenes​​​‌ in the case of‌ an image collection or‌​‌ a video. The main​​ scientific challenges in this​​​‌ topic stem from learning‌ the structure of the‌​‌ probabilistic graphical model as​​ well as the parameters​​​‌ of the cost functions‌ quantifying the relationships among‌​‌ its entities. In the​​ following we will present​​​‌ work related to all‌ these three topics and‌​‌ then elaborate on our​​ research directions.

  • Deep features​​​‌ for vision. Deep learning‌ models provide a rich‌​‌ representation of complex objects​​ but in return have​​​‌ a large number of‌ parameters. Thus, to work‌​‌ well on difficult tasks,​​ a large amount of​​​‌ data is required. In‌ this context, video presents‌​‌ several advantages: objects are​​ observed from a large​​​‌ range of viewpoints, motion‌ information allows the extraction‌​‌ of moving objects and​​ parts, and objects can​​​‌ be differentiated by their‌ motion patterns. We initially‌​‌ plan to develop deep​​ features for videos that​​​‌ incorporate temporal information at‌ multiple scales. We then‌​‌ plan to further exploit​​ the rich content in​​​‌ video by incorporating additional‌ cues such as minimal‌​‌ prior knowledge of the​​ object of interest, with​​​‌ the goal of learning‌ a representation that is‌​‌ more appropriate for video​​ understanding. In other words,​​​‌ a representation that is‌ learned from video data‌​‌ and targeted at specific​​ applications.
  • Structured models. The​​​‌ interactions among various elements‌ in a scene, such‌​‌ as the objects and​​ regions in it, the​​​‌ motion of object parts‌ or entire objects themselves,‌​‌ form a key element​​ for understanding image or​​​‌ video content. These rich‌ cues define the structure‌​‌ of visual data and​​​‌ how it evolves spatio-temporally.​ We plan to develop​‌ a novel graphical model​​ to exploit this structure.​​​‌ The main components in​ this graphical model are​‌ spatio-temporal regions (in the​​ case of video or​​​‌ simply image regions), which​ can represent object parts​‌ or entire objects themselves,​​ and the interactions among​​​‌ several entities. The dependencies​ among the scene entities​‌ are defined with a​​ higher order or a​​​‌ global cost function. A​ higher order constraint is​‌ a generalization of the​​ pairwise interaction term, and​​​‌ is a cost function​ involving more than two​‌ components in the scene,​​ e.g., several regions, whereas​​​‌ a global constraint imposes​ a cost term over​‌ the entire image or​​ vide such as a​​​‌ prior knowledge on the​ number of people expected​‌ in the scene. The​​ constraints we plan to​​​‌ include generalize several existing​ methods, which are limited​‌ to pairwise interactions or​​ a small restrictive set​​​‌ of higher-order costs. In​ addition to learning the​‌ parameters of these novel​​ functions, we will focus​​​‌ on learning the structure​ of the graph itself—a​‌ challenging problem that is​​ seldom addressed in current​​​‌ approaches. This provides an​ elegant way to go​‌ beyond state-of-the-art deep learning​​ methods, which are limited​​​‌ to learning the high-level​ interaction among parts of​‌ an object, by learning​​ the relationships among objects.​​​‌

3.2 Learning of visual​ models from minimal supervision​‌

Today's approaches to visual​​ recognition learn models for​​​‌ a limited and fixed​ set of visual categories​‌ with fully supervised classification​​ techniques. This paradigm has​​​‌ been adopted in the​ early 2000's, and within​‌ it enormous progress has​​ been made over the​​​‌ last decade.

The scale​ and diversity in today's​‌ large and growing image​​ and video collections (such​​​‌ as, e.g., broadcast archives,​ and personal image/video collections)​‌ call for a departure​​ from the current paradigm.​​​‌ This is the case​ because to answer queries​‌ about such data, it​​ is unfeasible to learn​​​‌ the models of visual​ content by manually and​‌ precisely annotating every relevant​​ concept, object, scene, or​​​‌ action category in a​ representative sample of everyday​‌ conditions. For one, it​​ will be difficult, or​​​‌ even impossible to decide​ a-priori what are the​‌ relevant categories and the​​ proper granularity level. Moreover,​​​‌ the cost of such​ annotations would be prohibitive​‌ in most application scenarios.​​ One of the main​​​‌ goals of the Thoth​ project-team is to develop​‌ a new framework for​​ learning visual recognition models​​​‌ by actively exploring large​ digital image and video​‌ sources (off-line archives as​​ well as growing on-line​​​‌ content), and exploiting the​ weak supervisory signal provided​‌ by the accompanying metadata​​ (such as captions, keywords,​​​‌ tags, subtitles, or scripts)​ and audio signal (from​‌ which we can for​​ example extract speech transcripts,​​​‌ or exploit speaker recognition​ models).

Textual metadata has​‌ traditionally been used to​​ index and search for​​​‌ visual content. The information​ in metadata is, however,​‌ typically sparse (e.g., the​​ location and overall topic​​​‌ of newscasts in a​ video archive 1)​‌ and noisy (e.g., a​​ movie script may tell​​ us that two persons​​​‌ kiss in some scene,‌ but not when, and‌​‌ the kiss may occur​​ off the screen or​​​‌ not have survived the‌ final cut). For this‌​‌ reason, metadata search should​​ be complemented by visual​​​‌ content based search, where‌ visual recognition models are‌​‌ used to localize content​​ of interest that is​​​‌ not mentioned in the‌ metadata, to increase the‌​‌ usability and value of​​ image/video archives. The key​​​‌ insight that we build‌ on in this research‌​‌ axis is that while​​ the metadata for a​​​‌ single image or video‌ is too sparse and‌​‌ noisy to rely on​​ for search, the metadata​​​‌ associated with large video‌ and image databases collectively‌​‌ provide an extremely versatile​​ source of information to​​​‌ learn visual recognition models.‌ This form of “embedded‌​‌ annotation” is rich, diverse​​ and abundantly available. Mining​​​‌ these correspondences from the‌ web, TV and film‌​‌ archives, and online consumer​​ generated content sites such​​​‌ as Flickr, Facebook, or‌ YouTube, guarantees that the‌​‌ learned models are representative​​ for many different situations,​​​‌ unlike models learned from‌ manually collected fully supervised‌​‌ training data sets which​​ are often biased.

The​​​‌ approach we propose to‌ address the limitations of‌​‌ the fully supervised learning​​ paradigm aligns with “Big​​​‌ Data” approaches developed in‌ other areas: we rely‌​‌ on the orders-of-magnitude-larger training​​ sets that have recently​​​‌ become available with metadata‌ to compensate for less‌​‌ explicit forms of supervision.​​ This will form a​​​‌ sustainable approach to learn‌ visual recognition models for‌​‌ a much larger set​​ of categories with little​​​‌ or no manual intervention.‌ Reducing and ultimately removing‌​‌ the dependency on manual​​ annotations will dramatically reduce​​​‌ the cost of learning‌ visual recognition models. This‌​‌ in turn will allow​​ such models to be​​​‌ used in many more‌ applications, and enable new‌​‌ applications based on visual​​ recognition beyond a fixed​​​‌ set of categories, such‌ as natural language based‌​‌ querying for visual content.​​ This is an ambitious​​​‌ goal, given the sheer‌ volume and intrinsic variability‌​‌ of the every day​​ visual content available on-line,​​​‌ and the lack of‌ a universally accepted formalism‌​‌ for modeling it. Yet,​​ the potential payoff is​​​‌ a breakthrough in visual‌ object recognition and scene‌​‌ understanding capabilities.

This research​​ axis is organized into​​​‌ the following three sub-tasks:‌

  • Weakly supervised learning. For‌​‌ object localization we will​​ go beyond current methods​​​‌ that learn one category‌ model at a time‌​‌ and develop methods that​​ learn models for different​​​‌ categories concurrently. This allows‌ “explaining away” effects to‌​‌ be leveraged, i.e., if​​ a certain region in​​​‌ an image has been‌ identified as an instance‌​‌ of one category, it​​ cannot be an instance​​​‌ of another category at‌ the same time. For‌​‌ weakly supervised detection in​​ video we will consider​​​‌ detection proposal methods. While‌ these are effective for‌​‌ still images, recent approaches​​ for the spatio-temporal domain​​​‌ need further improvements to‌ be similarly effective. Furthermore,‌​‌ we will exploit appearance​​ and motion information jointly​​​‌ over a set of‌ videos. In the video‌​‌ domain we will also​​​‌ continue to work on​ learning recognition models from​‌ subtitle and script information.​​ The basis of leveraging​​​‌ the script data which​ does not have a​‌ temporal alignment with the​​ video is to use​​​‌ matches in the narrative​ in the script and​‌ the subtitles (which do​​ have a temporal alignment​​​‌ with the video). We​ will go beyond simple​‌ correspondences between names and​​ verbs relating to self-motion,​​​‌ and match more complex​ sentences related to interaction​‌ with objects and other​​ people. To deal with​​​‌ the limited number of​ occurrences of such actions​‌ in a single movie,​​ we will consider approaches​​​‌ that learn action models​ across a collection of​‌ movies.
  • Online learning of​​ visual models. As a​​​‌ larger number of visual​ category models is being​‌ learned, online learning methods​​ become important, since new​​​‌ training data and categories​ will arrive over time.​‌ We will develop online​​ learning methods that can​​​‌ incorporate new examples for​ existing category models, and​‌ learn new category models​​ from few examples by​​​‌ leveraging similarity to related​ categories using multi-task learning​‌ methods. Here we will​​ develop new distance-based classifiers​​​‌ and attribute and label​ embedding techniques, and explore​‌ the use of NLP​​ techniques such as skipgram​​​‌ models to automatically determine​ between which classes transfer​‌ should occur. Moreover, NLP​​ will be useful in​​​‌ the context of learning​ models for many categories​‌ to identify synonyms, and​​ to determine cases of​​​‌ polysemy (e.g. jaguar car​ brand v.s. jaguar animal),​‌ and merge or refine​​ categories accordingly. Ultimately this​​​‌ will result in methods​ that are able to​‌ learn an“encyclopedia” of visual​​ models.
  • Visual search from​​​‌ unstructured textual queries. We​ will build on recent​‌ approaches that learn recognition​​ models on-the-fly (as the​​​‌ query is issued) from​ generic image search engines​‌ such as Google Images.​​ While it is feasible​​​‌ to learn models in​ this manner in a​‌ matter of seconds, it​​ is challenging to use​​​‌ the model to retrieve​ relevant content in real-time​‌ from large video archives​​ of more than a​​​‌ few thousand hours. To​ achieve this requires feature​‌ compression techniques to store​​ visual representations in memory,​​​‌ and cascaded search techniques​ to avoid exhaustive search.​‌ This approach, however, leaves​​ untouched the core problem​​​‌ of how to associate​ visual material with the​‌ textual query in the​​ first place. The second​​​‌ approach we will explore​ is based on image​‌ annotation models. In particular​​ we will go beyond​​​‌ image-text retrieval methods by​ using recurrent neural networks​‌ such as Elman networks​​ or long short-term memory​​​‌ (LSTM) networks to generate​ natural language sentences to​‌ describe images.

3.3 Large-scale​​ learning and optimization

We​​​‌ have entered an era​ of massive data acquisition,​‌ leading to the revival​​ of an old scientific​​​‌ utopia: it should be​ possible to better understand​‌ the world by automatically​​ converting data into knowledge.​​​‌ It is also leading​ to a new economic​‌ paradigm, where data is​​ a valuable asset and​​​‌ a source of activity.​ Therefore, developing scalable technology​‌ to make sense of​​ massive data has become​​ a strategic issue. Computer​​​‌ vision has already started‌ to adapt to these‌​‌ changes.

In particular, very​​ high-dimensional models such as​​​‌ deep networks are becoming‌ highly popular and successful‌​‌ for visual recognition. This​​ change is closely related​​​‌ to the advent of‌ big data. On the‌​‌ one hand, these models​​ involve a huge number​​​‌ of parameters and are‌ rich enough to represent‌​‌ well complex objects such​​ as natural images or​​​‌ text corpora. On the‌ other hand, they are‌​‌ prone to overfitting (fitting​​ too closely to training​​​‌ data without being able‌ to generalize to new‌​‌ unseen data) despite regularization;​​ to work well on​​​‌ difficult tasks, they require‌ a large amount of‌​‌ labeled data that has​​ been available only recently.​​​‌ Other cues may explain‌ their success: the deep‌​‌ learning community has made​​ significant engineering efforts, making​​​‌ it possible to learn‌ in a day on‌​‌ a GPU large models​​ that would have required​​​‌ weeks of computations on‌ a traditional CPU, and‌​‌ it has accumulated enough​​ empirical experience to find​​​‌ good hyper-parameters for its‌ networks.

To learn the‌​‌ huge number of parameters​​ of deep hierarchical models​​​‌ requires scalable optimization techniques‌ and large amounts of‌​‌ data to prevent overfitting.​​ This immediately raises two​​​‌ major challenges: how to‌ learn without large amounts‌​‌ of labeled data, or​​ with weakly supervised annotations?​​​‌ How to efficiently learn‌ such huge-dimensional models? To‌​‌ answer the above challenges,​​ we will concentrate on​​​‌ the design and theoretical‌ justifications of deep architectures‌​‌ including our recently proposed​​ deep kernel machines, with​​​‌ a focus on weakly‌ supervised and unsupervised learning,‌​‌ and develop continuous and​​ discrete optimization techniques that​​​‌ push the state of‌ the art in terms‌​‌ of speed and scalability.​​

This research axis will​​​‌ be developed into three‌ sub-tasks:

  • Deep kernel machines‌​‌ for structured data. Deep​​ kernel machines combine advantages​​​‌ of kernel methods and‌ deep learning. Both approaches‌​‌ rely on high-dimensional models.​​ Kernels implicitly operate in​​​‌ a space of possibly‌ infinite dimension, whereas deep‌​‌ networks explicitly construct high-dimensional​​ nonlinear data representations. Yet,​​​‌ these approaches are complementary:‌ Kernels can be built‌​‌ with deep learning principles​​ such as hierarchies and​​​‌ convolutions, and approximated by‌ multilayer neural networks. Furthermore,‌​‌ kernels work with structured​​ data and have well​​​‌ understood theoretical principles. Thus,‌ a goal of the‌​‌ Thoth project-team is to​​ design and optimize the​​​‌ training of such deep‌ kernel machines.
  • Large-scale parallel‌​‌ optimization. Deep kernel machines​​ produce nonlinear representations of​​​‌ input data points. After‌ encoding these data points,‌​‌ a learning task is​​ often formulated as a​​​‌ large-scale convex optimization problem‌; for example, this‌​‌ is the case for​​ linear support vector machines,​​​‌ logistic regression classifiers, or‌ more generally many empirical‌​‌ risk minimization formulations. We​​ intend to pursue recent​​​‌ efforts for making convex‌ optimization techniques that are‌​‌ dedicated to machine learning​​ more scalable. Most existing​​​‌ approaches address scalability issues‌ either in model size‌​‌ (meaning that the function​​ to minimize is defined​​​‌ on a domain of‌ very high dimension), or‌​‌ in the amount of​​​‌ training data (typically, the​ objective is a large​‌ sum of elementary functions).​​ There is thus a​​​‌ large room for improvements​ for techniques that jointly​‌ take these two criteria​​ into account.
  • Large-scale graphical​​​‌ models. To represent structured​ data, we will also​‌ investigate graphical models and​​ their optimization. The challenge​​​‌ here is two-fold: designing​ an adequate cost function​‌ and minimizing it. While​​ several cost functions are​​​‌ possible, their utility will​ be largely determined by​‌ the efficiency and the​​ effectiveness of the optimization​​​‌ algorithms for solving them.​ It is a combinatorial​‌ optimization problem involving billions​​ of variables and is​​​‌ NP-hard in general, requiring​ us to go beyond​‌ the classical approximate inference​​ techniques. The main challenges​​​‌ in minimizing cost functions​ stem from the large​‌ number of variables to​​ be inferred, the inherent​​​‌ structure of the graph​ induced by the interaction​‌ terms (e.g., pairwise terms),​​ and the high-arity terms​​​‌ which constrain multiple entities​ in a graph.

4​‌ Application domains

4.1 Visual​​ applications

Any solution to​​​‌ automatically understanding images and​ videos on a semantic​‌ level will have an​​ immediate impact on a​​​‌ wide range of applications.​ For example:

  • Semantic-level image​‌ and video access is​​ highly relevant for visual​​​‌ search on the Web,​ in professional archives and​‌ personal collections.
  • Visual data​​ organization is applicable to​​​‌ organizing family photo and​ video albums as well​‌ as to large-scale information​​ retrieval.
  • Visual object recognition​​​‌ has potential applications ranging​ from autonomous driving, to​‌ service robotics for assistance​​ in day-to-day activities as​​​‌ well as the medical​ domain.
  • Real-time scene understanding​‌ is relevant for human​​ interaction through devices such​​​‌ as HoloLens, Oculus Rift.​

4.2 Pluri-disciplinary research

Machine​‌ learning is intrinsically pluri-disciplinary.​​ By developing large-scale machine​​​‌ learning models and algorithms​ for processing data, the​‌ Thoth team became naturally​​ involved in pluri-disciplinary collaborations​​​‌ that go beyond visual​ modelling. During the last​‌ few years, Thoth has​​ conducted several collaborations in​​​‌ other fields such as​ neuroimaging, bioinformatics, ecology, natural​‌ language processing, and remote​​ sensing.

5 Social and​​​‌ environmental responsibility

5.1 Footprint​ of research activities

Compute​‌

 

A significant amount of​​ the team’s computations are​​​‌ performed on Jean Zay​ national cluster. According to​‌ the cluster's reporting platform,​​ 50k normalized GPU hours​​​‌ have been used by​ the team, which amounts​‌ to 1.2 tons eqCO2.​​ Besides computations performed on​​​‌ this cluster, the team​ maintained its own cluster,​‌ on which part of​​ the computations are done​​​‌ as well. Assuming 10​ GPUs are used at​‌ all times (which is​​ a rather generous estimate),​​​‌ this amounts to less​ than 100k GPU hours​‌ over the year. Most​​ of these machines are​​​‌ hosted in the datacenter​ of the IMAG building,​‌ which is probably slightly​​ less efficient than the​​​‌ GENCI infrastructure. Overall, we​ estimate our local consumption​‌ to be under 3​​ tons eqCO2.

In total,​​​‌ we estimate the emissions​ of the team's compute​‌ to be about 4​​ tons eqCO2. While we​​​‌ do not provide impact​ if term of resources,​‌ the team dedicated a​​ special effort to keep​​ local computing servers running​​​‌ for as long as‌ possible, upgrading them when‌​‌ possible to avoid replacing​​ them.

This does not​​​‌ count the Dino (V2‌ and V3) models, which‌​‌ are significantly more expensive​​ to train but are​​​‌ also significantly more impactful‌ than an average research‌​‌ paper, being used in​​ more than 10K scientific​​​‌ projects, and for which‌ full emissions data is‌​‌ available.

Travel

 

The other​​ main CO2eq footprint is​​​‌ international flights. While we‌ did not gather specific‌​‌ numbers, team members take​​ special care in reducing​​​‌ their plane travels (several‌ permanent researchers have not‌​‌ traveled by plane for​​ several years), refusing distant​​​‌ invitations, as well as‌ encouraging less travel-hungry community‌​‌ practices. This has led​​ to a drastic reduction​​​‌ of our travel impact‌ over the years, which‌​‌ we will try to​​ quantify for the next​​​‌ activity report.

5.2 Impact‌ of research results

A‌​‌ large part of Thoth's​​ team research contributes to​​​‌ advancing the field of‌ machine learning as a‌​‌ whole. This improves and​​ promotes Artificial Intelligence tools,​​​‌ which have a large,‌ still growing, and controversial‌​‌ societal impact (automation, recommendation​​ algorithms, mass surveillance...). Besides​​​‌ these impacts, machine learning‌ has a substantial (and‌​‌ also growing) environmental footprint,​​ and is especially prone​​​‌ to the rebound effect,‌ making efficiency improvements unable‌​‌ to reduce this impact.​​

Beyond methodological contributions, team​​​‌ members make more targeted‌ applied contributions that leverage‌​‌ Machine Learning for advancing​​ other sciences (e.g., astrophysics,​​​‌ earth science, physics simulations…).‌ Some of these projects‌​‌ focus on reducing carbon​​ footprints (e.g., by making​​​‌ electricity management more efficient),‌ or preserving biodiversity (e.g.,‌​‌ by better understanding ecosystem​​ responses to human pressure​​​‌ and global warming).

“Environmental-friendly”‌ contributions do not offset‌​‌ the negative socio-environmental impacts​​ of the current global​​​‌ AI race, which should‌ be tackled at a‌​‌ larger scale. Hence, Thoth​​ team members are involved​​​‌ at several levels (scientific‌ policy, popularization of science,‌​‌ local socio-environmental initiatives) to​​ support meaningful decision-making regarding​​​‌ these issues and future‌ technological developments at a‌​‌ broader level.

6 Highlights​​ of the year

6.1​​​‌ Awards

  • Prix Jeunes Talents‌ L'Oréal-UNESCO 2025 for Bianca‌​‌ Marin Moreno
  • Karteek Alahari​​ received the Outstanding IJCV​​​‌ Editorial Board Member Award.‌
  • Julien Mairal received a‌​‌ top reviewer award at​​ NeurIPS 2025.
  • Michael Arbel​​​‌ received a top reviewer‌ award at AISTATS 2025.‌​‌
  • The spin-off Enhance Lab​​ received the i-Lab prize​​​‌ from BPI.
  • J. Chanussot‌ received the 2025 IEEE‌​‌ GRSS Highest Impact Paper​​ Award (HIPA) selected from​​​‌ the 18670 papers published‌ in the journals of‌​‌ the IEEE Geoscience and​​ Remote Sensing Society in​​​‌ 2020-2024
  • J. Chanussot was‌ recognized a Highly Cited‌​‌ Research (Clarivate Analytics)

7​​ Latest software developments, platforms,​​​‌ open data

7.1 Latest‌ software developments

7.1.1 Cyanure‌​‌

  • Name:
    Cyanure: An Open-Source​​ Toolbox for Empirical Risk​​​‌ Minimization
  • Functional Description:
    Cyanure‌ is an open-source C++‌​‌ software package with a​​ Python interface. The goal​​​‌ of Arsenic is to‌ provide state-of-the-art solvers for‌​‌ learning linear models, based​​ on stochastic variance-reduced stochastic​​​‌ optimization with acceleration mechanisms‌ and Quasi-Newton principles. Arsenic‌​‌ can handle a large​​​‌ variety of loss functions​ (logistic, square, squared hinge,​‌ multinomial logistic) and regularization​​ functions (l2, l1, elastic-net,​​​‌ fused Lasso, multi-task group​ Lasso). It provides a​‌ simple Python API, which​​ is very close to​​​‌ that of scikit-learn, which​ should be extended to​‌ other languages such as​​ R or Matlab in​​​‌ a near future.
  • Release​ Contributions:
    packaging on conda​‌ and pipy + various​​ improvements
  • URL:
  • Contact:​​​‌
    Julien Mairal
  • Participant:
    2​ anonymous participants

7.1.2 MLXP​‌

  • Name:
    Machine Learning eXperimentalist​​ for Python
  • Keywords:
    Reproducibility,​​​‌ Replication and consistency, Machine​ learning
  • Functional Description:
    MLXP​‌ is an open-source, simple,​​ and lightweight experiment management​​​‌ tool based on Python.​ It streamlines the experimental​‌ process with minimal practitioner​​ overhead while ensuring a​​​‌ high level of reproducibility.​ As an open-source package,​‌ MLXP facilitates experiment launching,​​ logging, and efficient result​​​‌ exploitation. Key components include​ automated job launching and​‌ hierarchical configuration files, logging​​ of experiment outputs along​​​‌ with metadata, automated code​ and job version management,​‌ seamless multi-job submission to​​ a HPC job scheduler,​​​‌ and intuitive result exploitation​ capabilities including querying results,​‌ grouping and aggregation operations.​​
  • URL:
  • Contact:
    Michael​​​‌ Arbel

8 New results​

8.1 Visual Recognition

Object-wise​‌ Distance Estimation for Event​​ Camera Data

Participants: Nan​​​‌ Cai, Pia Bideau​.

Event cameras provide​‌ a natural and data​​ efficient representation of visual​​​‌ information, motivating novel computational​ strategies towards extracting visual​‌ information. Inspired by the​​ biological vision system, in​​​‌ this work 26 propose​ a behavior driven approach​‌ for object-wise distance estimation​​ from event camera data.​​​‌ This behavior-driven method mimics​ how biological systems, like​‌ the human eye, stabilize​​ their view based on​​​‌ object distance: distant objects​ require minimal compensatory rotation​‌ to stay in focus,​​ while nearby objects demand​​​‌ greater adjustments to maintain​ alignment. This adaptive strategy​‌ leverages natural stabilization behaviors​​ to estimate relative distances​​​‌ effectively. Unlike traditional vision​ algorithms that estimate depth​‌ across the entire image,​​ our approach targets local​​​‌ depth estimation within a​ specific region of interest.​‌ By aligning events within​​ a small region, we​​​‌ estimate the angular velocity​ required to stabilize the​‌ image motion. We demonstrate​​ that, under certain assumptions,​​​‌ the compensatory rotational flow​ is inversely proportional to​‌ the object's distance. The​​ proposed approach achieves new​​​‌ state-of-the-art accuracy in distance​ estimation on the dataset​‌ EVIMO2.

Figure 1

Figure

Figure 1​​: Distance estimation from​​​‌ event data.
Salience-SGG: Enhancing​ Unbiased Scene Graph Generation​‌ with Iterative Salience Estimation​​

Participants: Runfeng Qu,​​​‌ Ole Hall, Pia​ Bideau, Julie Ouerfelli-Ethier​‌, Martin Rolfs,​​ Klaus Obermayer, Olaf​​​‌ Hellwich.

Scene Graph​ Generation (SGG) suffers from​‌ a long-tailed distribution, where​​ a few predicate classes​​​‌ dominate while many others​ are underrepresented, leading to​‌ biased models that underperform​​ on rare relations. Unbiased-SGG​​​‌ methods address this by​ implementing debiasing strategies, but​‌ often at the cost​​ of spatial understanding—resulting in​​​‌ over-reliance on semantic priors.​ In 37, we​‌ introduce Salience-SGG, a novel​​ framework featuring an Iterative​​​‌ Salience Decoder (ISD) that​ emphasizes triplets with salient​‌ spatial structures. To support​​ this, we propose semantic-agnostic​​ salience labels guiding ISD.​​​‌ Evaluations on Visual Genome,‌ Open Images V6, and‌​‌ GQA-200 show that Salience-SGG​​ achieves state-of-the-art performance and​​​‌ improves existing Unbiased-SGG methods‌ in their spatial understanding‌​‌ as demonstrated by the​​ Pairwise Localization Average Precision.​​​‌

Code is available on‌ github.

Figure 2

Figure

Figure‌​‌ 2: Salience SGG.​​ (a): Methods based on​​​‌ standard debiasing, showing over-reliance‌ on the semantic information,‌​‌ i.i.e. coat and hanging​​ from (dashed lines). (b):​​​‌ Our salience-enhanced model favors‌ spatially coherent triplets (bold).‌​‌
Watching Swarm Dynamics from​​ Above: A Framework for​​​‌ Advanced Object Tracking in‌ Drone Videos

Participants: Pia‌​‌ Bideau, Duc Pham​​, Félicie Dhellemmes,​​​‌ Matthew Hansen, Jens‌ Krause.

Easily accessible‌​‌ technologies, such as drones​​ equipped with diverse onboard​​​‌ sensors, have greatly expanded‌ opportunities to study animal‌​‌ behavior in natural environments.​​ However, analyzing large volumes​​​‌ of unlabeled video data,‌ often spanning hours, remains‌​‌ a significant challenge for​​ machine learning, particularly in​​​‌ computer vision. Existing approaches‌ typically process only a‌​‌ small number of frames,​​ and accurate georeferencing of​​​‌ tracked positions is still‌ largely unresolved, particularly in‌​‌ dynamic environments where static​​ landmarks cannot be established.​​​‌ In this work, we‌ focus on long-term tracking‌​‌ of animal behavior in​​ real-world geographic coordinates. To​​​‌ address this challenge, we‌ utilize classical probabilistic methods‌​‌ for state estimation, such​​ as particle filtering. Particle​​​‌ filters offer a useful‌ algorithmic structure for recursively‌​‌ adding new incoming information​​ and thus ensuring time​​​‌ consistency. By incorporating recent‌ developments in semantic object‌​‌ segmentation, we enable continuous​​ tracking of rapidly evolving​​​‌ object formations, even in‌ scenarios with limited data‌​‌ availability. We propose a​​ novel approach for tracking​​​‌ schools of fish in‌ the open ocean from‌​‌ drone videos. Our framework​​ not only performs classical​​​‌ object tracking in image‌ coordinates, instead it additionally‌​‌ tracks the position and​​ spatial expansion of the​​​‌ fish school in geographic‌ coordinates by fusing video‌​‌ data and the drone's​​ on board sensor information​​​‌ (GPS and IMU). No‌ landmarks with known geographic‌​‌ coordinates are required, making​​ the proposed method adaptable​​​‌ to unstructured, dynamic environments‌ like the open ocean,‌​‌ where static landmarks are​​ unavailable. With this, the​​​‌ presented framework enables researchers‌ to study the collective‌​‌ behavior of fish schools​​ within their social and​​​‌ environmental context.

Code and‌ the newly introduced dataset‌​‌ for tracking collective animal​​ behavior over long time​​​‌ horizons in marine environments‌ are available here.‌​‌

Figure 3

Figure

Figure 3:​​ An illustration of tracking​​​‌ animal swarms in drone‌ videos using particle filters‌​‌ and deep learning.
LUDVIG:​​ Learning-free Uplifting of 2D​​​‌ Visual features to Gaussian‌ Splatting scenes.

Participants: Juliette‌​‌ Marrie, Romain Menegaux​​, Michael Arbel,​​​‌ Diane Larlus, Julien‌ Mairal.

In 34‌​‌, we address the​​ problem of extending the​​​‌ capabilities of vision foundation‌ models such as DINO,‌​‌ SAM, and CLIP, to​​ 3D tasks. Specifically, we​​​‌ introduce a novel method‌ to uplift 2D image‌​‌ features into 3D Gaussian​​ Splatting scenes. Unlike traditional​​​‌ approaches that rely on‌ minimizing a reconstruction loss,‌​‌ our method employs a​​​‌ simpler and more efficient​ feature aggregation technique, augmented​‌ by a graph diffusion​​ mechanism. Graph diffusion enriches​​​‌ features from a given​ model, such as CLIP,​‌ by leveraging 3D geometry​​ and pairwise similarities induced​​​‌ by another strong model​ such as DINOv2. Our​‌ approach achieves performance comparable​​ to the state of​​​‌ the art on multiple​ downstream tasks while delivering​‌ significant speed-ups. Notably, we​​ obtain competitive segmentation results​​​‌ using generic DINOv2 features,​ despite DINOv2 not being​‌ trained on millions of​​ annotated segmentation masks like​​​‌ SAM. When applied to​ CLIP features, our method​‌ demonstrates strong performance in​​ open-vocabulary object detection tasks,​​​‌ highlighting the versatility of​ our approach.

Figure 4

Figure

Figure​‌ 4: Illustration of​​ the LUDVIG approach.
Cluster​​​‌ and Predict Latent Patches​ for Improved Masked Image​‌ Modeling

Participants: Maxime Oquab​​, Federico Baldassarre,​​​‌ Timothee Darcet, Julien​ Mairal, Piotr Bojanowski​‌.

Masked Image Modeling​​ (MIM) offers a promising​​​‌ approach to self-supervised representation​ learning, however existing MIM​‌ models still lag behind​​ the state-of-the-art. In this​​​‌ paper 8, we​ systematically analyze target representations,​‌ loss functions, and architectures,​​ to introduce CAPI -​​​‌ a novel pure-MIM framework​ that relies on the​‌ prediction of latent clusterings.​​ Our approach leverages a​​​‌ clustering-based loss, which is​ stable to train, and​‌ exhibits promising scaling properties.​​ Our ViT-L backbone, CAPI,​​​‌ achieves 83.8% accuracy on​ ImageNet and 32.1% mIoU​‌ on ADE20K with simple​​ linear probes, substantially outperforming​​​‌ previous MIM methods and​ approaching the performance of​‌ the current state-of-the-art, DINOv2.​​ The approach is illustrated​​​‌ in Figure 5.​

Figure 5

Figure

Figure 5:​‌ Illustration of the CAPI​​ approach.
Entropy Rectifying Guidance​​​‌ for Diffusion and Flow​ Models

Participants: Tariq Berrada​‌ Ifriqi, Adriana Romero-Soriano​​, Michal Drozdzal,​​​‌ Jakob Verbeek, Karteek​ Alahari.

Guidance techniques​‌ are commonly used in​​ diffusion and flow models​​​‌ to improve image quality​ and input consistency for​‌ conditional generative tasks such​​ as class- conditional and​​​‌ text-to-image generation. In particular,​ classifier-free guidance (CFG) is​‌ the most widely adopted​​ guidance technique. It results,​​​‌ however, in trade-offs across​ quality, diversity and consistency:​‌ improving some at the​​ expense of others. While​​​‌ recent work has shown​ that it is possible​‌ to disentangle these factors​​ to some extent, such​​​‌ methods come with an​ overhead of requiring an​‌ additional (weaker) model, or​​ require more forward passes​​​‌ per sampling step. In​ this work 29,​‌ we propose Entropy Rectifying​​ Guidance (ERG), a simple​​​‌ and effective guidance method​ based on inference-time changes​‌ in the attention mechanism​​ of state-of-the-art diffusion transformer​​​‌ architectures, which allows for​ simultaneous improvements over image​‌ quality, diversity and prompt​​ consistency. ERG is more​​​‌ general than CFG and​ similar guidance techniques, as​‌ it extends to unconditional​​ sampling. We show that​​​‌ ERG results in significant​ improvements in various tasks,​‌ including text-to-image, class-conditional and​​ unconditional image generation (see​​​‌ examples in Figure 6​). We also show​‌ that ERG can be​​ seamlessly combined with other​​​‌ recent guidance methods such​ as CADS and APG,​‌ further improving generation results.​​

Figure 6

Figure

Figure 6:​​ Qualitative comparison of classifier-free​​​‌ guidance (CFG) and our‌ Entropy Rectifying Guidance (ERG).‌​‌
Boosting Latent Diffusion with​​ Perceptual Objectives

Participants: Tariq​​​‌ Berrada Ifriqi, Pietro‌ Astolfi, Melissa Hall‌​‌, Marton Havasi,​​ Yohann Benchetrit, Adriana​​​‌ Romero-Soriano, Karteek Alahari‌, Michal Drozdzal,‌​‌ Jakob Verbeek.

Latent​​ diffusion models (LDMs) power​​​‌ state-of-the-art high-resolution generative image‌ models. LDMs learn the‌​‌ data distribution in the​​ latent space of an​​​‌ autoencoder (AE) and produce‌ images by mapping the‌​‌ generated latents into RGB​​ image space using the​​​‌ AE decoder. While this‌ approach allows for efficient‌​‌ model training and sampling,​​ it induces a disconnect​​​‌ between the training of‌ the diffusion model and‌​‌ the decoder, resulting in​​ a loss of detail​​​‌ in the generated images.‌ To remediate this disconnect,‌​‌ we propose to leverage​​ the internal features of​​​‌ the decoder to define‌ a latent perceptual loss‌​‌ (LPL) 23. This​​ loss encourages the models​​​‌ to create sharper and‌ more realistic images. Our‌​‌ loss can be seamlessly​​ integrated with common autoencoders​​​‌ used in latent diffusion‌ models, and can be‌​‌ applied to different generative​​ modeling paradigms such as​​​‌ DDPM with epsilon and‌ velocity prediction, as well‌​‌ as flow matching. Extensive​​ experiments with models trained​​​‌ on three datasets at‌ 256 and 512 resolution‌​‌ show improved quantitative –​​ with boosts between 6​​​‌% and 20%‌ in FID – and‌​‌ qualitative results when using​​ our perceptual loss (see​​​‌ examples in Figure 7‌.

Figure 7

Figure

Figure 7‌​‌: Samples from models​​ trained with and without​​​‌ our latent perceptual loss‌ on CC12M.
Lightweight Structure-Aware‌​‌ Attention for Visual Understanding​​

Participants: Heeseung Kwon,​​​‌ Francisco M. Castro,‌ Manuel J. Marin-Jimenez,‌​‌ Nicolas Guil, Karteek​​ Alahari.

Attention operator​​​‌ has been widely used‌ as a basic brick‌​‌ in visual understanding since​​ it provides some flexibility​​​‌ through its adjustable kernels.‌ However, this operator suffers‌​‌ from inherent limitations: (1)​​ the attention kernel is​​​‌ not discriminative enough, resulting‌ in high redundancy, and‌​‌ (2) the complexity in​​ computation and memory is​​​‌ quadratic in the sequence‌ length. In this work‌​‌ 13, we propose​​ a novel attention operator,​​​‌ called Lightweight Structure-aware Attention‌ (LiSA), which has a‌​‌ better representation power with​​ log-linear complexity (see Figure​​​‌ 8). Our operator‌ transforms the attention kernels‌​‌ to be more discriminative​​ by learning structural patterns.​​​‌ These structural patterns are‌ encoded by exploiting a‌​‌ set of relative position​​ embeddings (RPEs) as multiplicative​​​‌ weights, thereby improving the‌ representation power of the‌​‌ attention kernels. Additionally, the​​ RPEs are approximated to​​​‌ obtain log-linear complexity. Our‌ experiments and analyses demonstrate‌​‌ that the proposed operator​​ outperforms self-attention and other​​​‌ existing operators, achieving state-of-the-art‌ results on ImageNet-1K and‌​‌ other downstream tasks such​​ as video action recognition​​​‌ on Kinetics-400, object detection‌ and instance segmentation on‌​‌ COCO, and semantic segmentation​​ on ADE-20K.

Figure 8

Figure

Figure​​​‌ 8: Self-attention vs.‌ LiSA. (a) Process of‌​‌ self-attention & LiSA: LiSA​​ updates the attention to​​​‌ the structure-aware attention via‌ RPEs. (b) Feature visualization‌​‌ of self-attention & LiSA:​​​‌ compared to self-attention, LiSA​ learns better features by​‌ capturing geometric structural patterns.​​
Source-free video domain adaptation​​​‌ by learning from noisy​ labels

Participants: Avijit Dasgupta​‌, C. V. Jawahar​​, Karteek Alahari.​​​‌

Despite the progress seen​ in classification methods, current​‌ approaches for handling videos​​ with distribution shifts in​​​‌ source and target domains​ remain source-dependent as they​‌ require access to the​​ source data during the​​​‌ adaptation stage. In this​ paper 9, we​‌ present a self-training based​​ source-free video domain adaptation​​​‌ approach to address this​ challenge by bridging the​‌ gap between the source​​ and the target domains.​​​‌ We use the source​ pre-trained model to generate​‌ pseudo-labels for the target​​ domain samples, which are​​​‌ inevitably noisy. Thus, we​ treat the problem of​‌ source-free video domain adaptation​​ as learning from noisy​​​‌ labels and argue that​ the samples with correct​‌ pseudo-labels can help us​​ in adaptation. To this​​​‌ end, we leverage the​ cross-entropy loss as an​‌ indicator of the correctness​​ of the pseudo-labels and​​​‌ use the resulting small-loss​ samples from the target​‌ domain for fine-tuning the​​ model. We further enhance​​​‌ the adaptation performance by​ implementing a teacher–student (TS)​‌ framework, in which the​​ teacher, which is updated​​​‌ gradually, produces reliable pseudo-labels.​ Meanwhile, the student undergoes​‌ fine-tuning on the target​​ domain videos using these​​​‌ generated pseudo-labels to improve​ its performance. Extensive experimental​‌ evaluations show that our​​ methods, termed as CleanAdapt,​​​‌ CleanAdapt + TS, achieve​ state-of-the-art results, outperforming the​‌ existing approaches on various​​ open datasets. Our source​​​‌ code is publicly available​.

Figure 9

Figure

Figure 9​‌: Existing approaches have​​ a source-dependent adaptation stage​​​‌ achieving marginal performance gain​ over the source-pretrained models.​‌ On the other hand,​​ our proposed methods CleanAdapt​​​‌ and CleanAdapt + TS​ achieve significant performance improvements​‌ over the source-only model​​ while being source-free (i.e.,​​​‌ the adaptation stage does​ not require videos from​‌ the source domain).
Flowception:​​ Temporally Expansive Flow Matching​​​‌ for Video Generation

Participants:​ Tariq Berrada Ifriqi,​‌ John Nguyen, Karteek​​ Alahari, Jakob Verbeek​​​‌, Ricky T. Q.​ Chen.

We present​‌ Flowception 46, a​​ novel non-autoregressive and variable-length​​​‌ video generation framework. Flowception​ learns a probability path​‌ that interleaves discrete frame​​ insertions with continuous frame​​​‌ denoising. Compared to autoregressive​ methods, Flowception alleviates error​‌ accumulation/drift as the frame​​ insertion mechanism during sampling​​​‌ serves as an efficient​ compression mechanism to handle​‌ long-term context (see examples​​ in Figure 10).​​​‌ Compared to full-sequence flows,​ our method reduces FLOPs​‌ for training three-fold, while​​ also being more amenable​​​‌ to local attention variants,​ and allowing to learn​‌ the length of videos​​ jointly with their content.​​​‌ Quantitative experimental results show​ improved FVD and VBench​‌ metrics over autoregressive and​​ full-sequence baselines, which is​​​‌ further validated with qualitative​ results. Finally, by learning​‌ to insert and denoise​​ frames in a sequence,​​​‌ Flowception seamlessly integrates different​ tasks such as image-to-video​‌ generation and video interpolation.​​

Figure 10

Figure

Figure 10:​​​‌ Examples of image-to-video (I2V)​ generation and video interpolation​‌ with Flowception. Input frames​​ marked by dashed boundaries.​​
Online In-Context Distillation for​​​‌ Low-Resource Vision Language Models‌

Participants: Zhiqi Kang,‌​‌ Rahaf Aljundi, Vaggelis​​ Dorovatas, Karteek Alahari​​​‌.

As the field‌ continues its push for‌​‌ ever more resources, this​​ work turns the spotlight​​​‌ on a critical question:‌ how can vision-language models‌​‌ (VLMs) be adapted to​​ thrive in low-resource, budget-constrained​​​‌ settings? While large VLMs‌ offer strong performance, they‌​‌ are impractical to deploy​​ in such settings. Small​​​‌ VLMs, on the other‌ hand, are efficient but‌​‌ typically require costly fine-tuning​​ to close the performance​​​‌ gap with larger models‌ in the deployment domain.‌​‌ Inspired by the in-context​​ learning framework, we propose​​​‌ an online In-Context Distillation‌ (ICD) method 48,‌​‌ in which a small​​ VLM collaborates with a​​​‌ stronger teacher model at‌ inference time, distilling its‌​‌ knowledge via sparse demonstrations​​ to efficiently bridge the​​​‌ gap between them (see‌ overview in Figure 11‌​‌). Our method is​​ built on an in-depth​​​‌ analysis that identifies the‌ scale and the choice‌​‌ of models for which​​ vision-language ICL is currently​​​‌ feasible, and demonstrates the‌ advantage of ICL over‌​‌ fine-tuning under constrained compute​​ budgets. We enhance our​​​‌ method with a novel‌ cross-modal demonstration selection strategy,‌​‌ teacher test-time scaling to​​ reduce noise, and student​​​‌ uncertainty conditioning to dynamically‌ populate a demonstration pool‌​‌ and minimize teacher queries.​​ Our ICD method significantly​​​‌ boosts the performance of‌ small models (up to‌​‌ 33%) using​​ scarce teacher annotations (as​​​‌ low as 4%‌), and competes with‌​‌ the teacher's zero-shot performance.​​

Figure 11

Figure

Figure 11:​​​‌ Overview of our online‌ In-Context Distillation framework.

8.2‌​‌ Statistical Machine Learning and​​ Optimization

Counterfactual Learning of​​​‌ Stochastic Policies with Continuous‌ Actions

Participants: Houssam Zenati‌​‌, Pierre Gaillard,​​ Julien Mairal.

Counterfactual​​​‌ reasoning from logged data‌ has become increasingly important‌​‌ for many applications such​​ as web advertising or​​​‌ healthcare. In 20,‌ we address the problem‌​‌ of counterfactual learning of​​ stochastic policies with continuous​​​‌ actions, which raises difficult‌ challenges about (i) data‌​‌ modelization, (ii) optimization, and​​ (iii) evaluation on real​​​‌ data. First, we introduce‌ a modeling strategy based‌​‌ on a joint kernel​​ embedding of contexts and​​​‌ actions, illustrated in Figure‌ 12 which overcomes the‌​‌ shortcomings of previous discretization​​ strategies as shown in​​​‌ 9. Second, we empirically‌ show that the optimization‌​‌ aspect of counterfactual learning​​ is more important than​​​‌ previously thought, and we‌ demonstrate the benefits of‌​‌ proximal point algorithms and​​ differentiable estimators. Finally, we​​​‌ propose an evaluation protocol‌ for offline policies in‌​‌ real-world logged systems, which​​ is challenging since policies​​​‌ cannot be replayed on‌ test data, and we‌​‌ release a new large-scale​​ dataset along with multiple​​​‌ synthetic, yet realistic, evaluation‌ setups.

Figure 12

Figure

Figure 12‌​‌: Illustration of the​​ counterfactual modeling approach.
MAP​​​‌ Estimation with Denoisers: Convergence‌ Rates and Guarantees

Participants:‌​‌ Scott Pesme, Giacomo​​ Meanti, Michael Arbel​​​‌, Julien Mairal.‌

Denoiser models have become‌​‌ powerful tools for inverse​​ problems, enabling the use​​​‌ of pretrained networks to‌ approximate the score of‌​‌ a smoothed prior distribution.​​​‌ These models are often​ used in heuristic iterative​‌ schemes aimed at solving​​ Maximum a Posteriori (MAP)​​​‌ optimisation problems, where the​ proximal operator of the​‌ negative log-prior plays a​​ central role. In practice,​​​‌ this operator is intractable,​ and practitioners plug in​‌ a pretrained denoiser as​​ a surrogate-despite the lack​​​‌ of general theoretical justification​ for this substitution. In​‌ 36, we show​​ that a simple algorithm,​​​‌ closely related to several​ used in practice, provably​‌ converges to the proximal​​ operator under a log-concavity​​​‌ assumption on the prior​ p. We show that​‌ this algorithm can be​​ interpreted as a gradient​​​‌ descent on smoothed proximal​ objectives. Our analysis thus​‌ provides a theoretical foundation​​ for a class of​​​‌ empirically successful but previously​ heuristic methods. This result​‌ is provided in Figure​​ 13.

Figure 13

Figure

Figure​​​‌ 13: Our main​ result.
Logarithmic Regret for​‌ Unconstrained Submodular Maximization Stochastic​​ Bandit

Participants: Julien Zhou​​​‌, Pierre Gaillard,​ Thibaud Rahier, Julyan​‌ Arbel.

In 40​​, we address the​​​‌ online unconstrained submodular maximization​ problem (Online USM), in​‌ a setting with stochastic​​ bandit feedback. In this​​​‌ framework, a decision-maker receives​ noisy rewards from a​‌ nonmonotone submodular function, taking​​ values in a known​​​‌ bounded interval. This paper​ proposes Double-Greedy - Explore-then-Commit​‌ (DG-ETC), adapting the Double-Greedy​​ approach from the offline​​​‌ and online full-information settings.​ DG-ETC satisfies a O​‌(dlo​​g(dT​​​‌)) problem dependent​ upper bound for the​‌ 1/2-approximate​​ pseudo-regret, as well as​​​‌ a O(d​T2/3​‌log(​​dT)1​​​‌/3) problem-free​ one at the same​‌ time, outperforming existing approaches.​​ To that end, we​​​‌ introduce a notion of​ hardness for submodular functions,​‌ characterizing how difficult it​​ is to maximize them​​​‌ with this type of​ strategy.

Figure 14

Hardness

Figure 14​‌: Illustration of our​​ new notion of hardness​​​‌ for submodular bandits. Logarithmic​ regret can be achieved​‌ as soon as the​​ problem parameters α and​​​‌ β are different.
Locally​ Adaptive Online Nonparametric Regression​‌

Participants: Paul Liautaud,​​ Pierre Gaillard, Olivier​​​‌ Wintenberger.

In 32​ and 31, We​‌ study online adversarial regression​​ with convex losses against​​​‌ a rich class of​ continuous yet highly irregular​‌ prediction rules, modeled by​​ Besov spaces Bp​​​‌,qs with​ general parameters 1≤​‌p,q≤​​ and smoothness s​​​‌>dp.​ We introduce an adaptive​‌ wavelet-based algorithm that performs​​ sequential prediction without prior​​​‌ knowledge of (s​,p,q​‌), and establish​​ minimax-optimal regret bounds against​​​‌ any comparator in B​p,qs​‌. We further design​​ a locally adaptive extension​​​‌ capable of dynamically tracking​ spatially inhomogeneous smoothness. This​‌ adaptive mechanism adjusts the​​ resolution of the predictions​​​‌ over both time and​ space, yielding refined regret​‌ bounds in terms of​​ local regularity. Consequently, in​​​‌ heterogeneous environments, our adaptive​ guarantees can significantly surpass​‌ those obtained by standard​​ global methods.

Figure 15

Regret

Figure​​ 15: Theoretical and​​​‌ practical regrets achieved by‌ our two procedures on‌​‌ simulated data.
Online Learning​​ Approach for Survival Analysis​​​‌

Participants: Camila Fernandez,‌ Pierre Gaillard, Olivier‌​‌ Wintenberger.

In 10​​, we introduce an​​​‌ online mathematical framework for‌ survival analysis, allowing real‌​‌ time adaptation to dynamic​​ environments and censored data.​​​‌ This framework enables the‌ estimation of event time‌​‌ distributions through an optimal​​ second order online convex​​​‌ optimization algorithm—Online Newton Step‌ (ONS). This approach, previously‌​‌ unexplored, presents substantial advantages,​​ including explicit algorithms with​​​‌ non-asymptotic convergence guarantees. Moreover,‌ we analyze the selection‌​‌ of ONS hyperparameters, which​​ depends on the exp-concavity​​​‌ property and has a‌ significant influence on the‌​‌ regret bound. We introduce​​ an adaptive aggregation method​​​‌ that ensures robustness in‌ hyperparameter selection while maintaining‌​‌ fast regret bounds. These​​ findings can extend beyond​​​‌ the survival analysis field,‌ and are relevant for‌​‌ any case characterized by​​ poor exp-concavity and unstable​​​‌ ONS. Additionally, we propose‌ a stochastic approach for‌​‌ ONS that guarantees logarithmic​​ regret in the case​​​‌ of an exponential hazard‌ model. Next, these assertions‌​‌ are illustrated by simulation​​ experiments, followed by an​​​‌ application to a real‌ dataset. Fernandez et al.‌​‌55 also provides some​​ experimental comparison of existing​​​‌ algorithms for survival analysis.‌

Figure 16

Error

Figure 16:‌​‌ Estimation errors of our​​ algorithms on simulated survival​​​‌ data.
Efficient and Near-Optimal‌ Online Portfolio Selection

Participants:‌​‌ Rémi Jézéquel, Dmitrii​​ Ostrovski, Pierre Gaillard​​​‌.

In 12,‌ we study online portfolio‌​‌ selection as introduced by​​ Cover (1991), where a​​​‌ trader allocates wealth over‌ d assets across T‌​‌ rounds to maximize logarithmic​​ return. Cover’s Universal Portfolios​​​‌ achieve worst-case optimal O‌(dlogT‌​‌) regret but require​​ costly d-dimensional integration,​​​‌ leading to a prohibitive‌ O˜(d‌​‌4(T+​​d)1/​​​‌4) per-round runtime.‌ We propose a new‌​‌ algorithm achieving essentially the​​ same regret—up to constants​​​‌ and replacing logT‌ with log(T‌​‌+d)—with​​ a drastically improved runtime​​​‌ of O˜(‌d2(T‌​‌+d))​​ per round. Our method​​​‌ selects portfolios by minimizing‌ logarithmic loss regularized by‌​‌ a log-determinant barrier, revealing​​ connections between online portfolio​​​‌ selection and classical cutting-plane‌ and interior-point methods.

Online‌​‌ Convex Reinforcement Learning with​​ applications to Demand-Side Management.​​​‌

Participants: Bianca Marin Moreno‌, Khaled Eldowa,‌​‌ Margaux Brégère, Pierre​​ Gaillard, Nadia Oudjane​​​‌.

To counter the‌ challenge of integrating fluctuating‌​‌ renewables into the grid,​​ devices like thermostatically controlled​​​‌ loads (water-heaters, air conditioners,‌ etc) offer flexible demand.‌​‌ However, efficiently controlling a​​ large population of these​​​‌ devices to track desired‌ consumption signals remains a‌​‌ complex challenge. Existing methods​​ lack convergence guarantees and​​​‌ computational efficiency, or resort‌ to regularization techniques instead‌​‌ of tackling the target​​ tracking problem directly. 14​​​‌ addresses these drawbacks. We‌ propose to model the‌​‌ problem as a finite​​ horizon episodic Markov decision​​​‌ process, enabling us to‌ adapt convex optimization algorithms‌​‌ with convergence guarantees and​​​‌ computational efficiency. This framework​ also extends to online​‌ learning scenarios, where daily​​ control decisions are made​​​‌ without prior knowledge of​ consumer behavior and with​‌ daily-changing target profiles due​​ to fluctuations of energy​​​‌ production and inflexible consumption.​ We introduce a new​‌ algorithm, called Online Target​​ Tracker (OTT), the first​​​‌ online learning load control​ method, for which we​‌ prove sub-linear regret. We​​ demonstrate our claims with​​​‌ realistic experiments. This combination​ of optimization and learning​‌ lays the groundwork for​​ more dynamic and efficient​​​‌ load control methods. 33​ studies a generalization of​‌ episodic Reinforcement Learning to​​ convex losses that could​​​‌ be applied for Demand-Side​ Management in an unknown​‌ environment. By introducing a​​ reset-free framework called the​​​‌ periodic framework, 49 weakens​ the episodic assumption to​‌ avoid having to reset​​ the population of the​​​‌ devices to the initial​ distribution at every episode.​‌

Figure 17

DSM

Figure 17:​​ Demand Side Management Problem.​​​‌
Optimized projection-free algorithms for​ online learning: construction and​‌ worst-case analysis

Participants: Julien​​ Weibel, Pierre Gaillard​​​‌, Wouter Koolen,​ Adrien Taylor.

In​‌ 53, we study​​ projection-free algorithms for online​​​‌ learning with linear optimization​ oracles (Frank–Wolfe methods) to​‌ handle constrained decision sets.​​ We propose an optimized​​​‌ variant of an online​ Frank–Wolfe algorithm with a​‌ simple potential-based analysis, and​​ introduce a semidefinite programming​​​‌ framework to jointly design​ and analyze such algorithms.​‌ Our numerical results suggest​​ that no pure online​​​‌ Frank–Wolfe method in this​ model class can achieve​‌ regret better than O​​(T3/​​​‌4) without additional​ assumptions. We further observe​‌ suboptimal constants in existing​​ methods, anytime guarantees of​​​‌ order O(t​3/4)​‌, and limited benefits​​ from multiple linear optimization​​​‌ steps per round.

Figure 18

Error​

Figure 18: Comparison​‌ of known regret upper​​ bounds against tight numerical​​​‌ bounds obtained from our​ analysis.
Optimal and Efficient​‌ Algorithms for Multinomial Logistic​​ Bandits

Participants: Pierre Boudart​​​‌, Pierre Gaillard,​ Alessandro Rudi, Aadirupa​‌ Saha.

In 38​​ and 44, we​​​‌ study active online assortment​ optimization with preference feedback,​‌ a framework for modeling​​ user choice and subsetwise​​​‌ utility maximization with applications​ in advertising, online retail,​‌ recommendation systems, and language​​ model fine-tuning. Existing approaches​​​‌ often rely on unrealistic​ assumptions such as strong​‌ reference items or repeated​​ identical assortments. In 38​​​‌, we design efficient​ regret-minimization algorithms that remove​‌ both of these assumptions.​​ In 44, we​​​‌ improve the asymptotic regret​ by a constant that​‌ may be exponentially large​​ in some cases.

Figure 19

MNL​​​‌ regret

Figure 19:​ Comparison of the error​‌ obtained when varying the​​ number of feedback in​​​‌ MNL bandits.
Advancing Prompt-Based​ Methods for Replay-Independent General​‌ Continual Learning

Participants: Zhiqi​​ Kang, Liyuan Wang​​​‌, Xingxing Zhang,​ Karteek Alahari.

General​‌ continual learning (GCL) is​​ a broad concept to​​​‌ describe real-world continual learning​ (CL) problems, which are​‌ often characterized by online​​ data streams without distinct​​​‌ transitions between tasks, i.e.,​ blurry task boundaries. Such​‌ requirements result in poor​​ initial performance, limited generalizability,​​ and severe catastrophic forgetting,​​​‌ heavily impacting the effectiveness‌ of mainstream GCL models‌​‌ trained from scratch (see​​ illustration in Figure 20​​​‌). While the use‌ of a frozen pretrained‌​‌ backbone with appropriate prompt​​ tuning can partially address​​​‌ these challenges, such prompt-based‌ methods remain suboptimal for‌​‌ CL of remaining tunable​​ parameters on the fly.​​​‌ In this regard, we‌ propose an innovative approach‌​‌ named MISA (Mask and​​ Initial Session Adaption) to​​​‌ advance prompt-based methods in‌ GCL 30. It‌​‌ includes a forgetting-aware initial​​ session adaption that employs​​​‌ pretraining data to initialize‌ prompt parameters and improve‌​‌ generalizability, as well as​​ a non-parametric logit mask​​​‌ of the output layers‌ to mitigate catastrophic forgetting.‌​‌ Empirical results demonstrate substantial​​ performance gains of our​​​‌ approach compared to recent‌ competitors, especially without a‌​‌ replay buffer (e.g., up​​ to 18.39, 22.06, and​​​‌ 11.96 points performance lead‌ on CIFAR-100, Tiny-ImageNet, and‌​‌ ImageNet-R, respectively). Moreover, our​​ approach features the plug-in​​​‌ nature for prompt-based methods,‌ independence of replay, ease‌​‌ of implementation, and avoidance​​ of CL-relevant hyperparameters, serving​​​‌ as a strong baseline‌ for GCL research. Our‌​‌ source code is publicly​​ available.

Figure 20

Figure

Figure​​​‌ 20: Problem setup‌ and motivation. Left: illustration‌​‌ of the GCL data​​ stream. Mid: average prediction​​​‌ accuracy at different timesteps‌ in GCL. Right: session‌​‌ 1 accuracy, where we​​ evaluate the retention of​​​‌ knowledge acquired at session‌ 1 after each session.‌​‌
Unified Breakdown Analysis for​​ Byzantine Robust Gossip

Participants:​​​‌ Renaud Gaucher, Aymeric‌ Dieuleveut, Hadrien Hendrikx‌​‌.

Distributed approaches have​​ many computational benefits, but​​​‌ they are vulnerable to‌ attacks from a subset‌​‌ of devices transmitting incorrect​​ information. This work 28​​​‌ investigates Byzantine-resilient algorithms in‌ a decentralized setting, where‌​‌ devices communicate directly with​​ one another. We investigate​​​‌ the notion of breakdown‌ point, and show an‌​‌ upper bound on the​​ number of adversaries that​​​‌ decentralized algorithms can tolerate.‌ This is done through‌​‌ careful study of a​​ specific graph topology, presented​​​‌ in Figure 21.‌ We introduce CG +‌​‌ , an algorithm at​​ the intersection of ClippedGossip​​​‌ and NNA, two popular‌ approaches for robust decentralized‌​‌ learning. CG + meets​​ our upper bound, and​​​‌ thus obtains optimal robustness‌ guarantees, whereas neither of‌​‌ the existing two does.​​ We provide experimental evidence​​​‌ for this gap by‌ presenting an attack tailored‌​‌ to sparse graphs which​​ breaks NNA but against​​​‌ which CG + is‌ robust.

Figure 21

Figure

Figure 21‌​‌: This Figure shows​​ the graph construction used​​​‌ for upper bound on‌ the maximum number of‌​‌ Byzantine nodes than can​​ be tolerated.
Byzantine-Robust Gossip:​​​‌ Insights from a Dual‌ Approach

Participants: Renaud Gaucher‌​‌, Aymeric Dieuleveut,​​ Hadrien Hendrikx.

Distributed​​​‌ learning has many computational‌ benefits but is vulnerable‌​‌ to attacks from a​​ subset of devices transmitting​​​‌ incorrect information. This paper‌ 45 investigates Byzantine-resilient algorithms‌​‌ in a decentralized setting,​​ where devices communicate directly​​​‌ in a peer-to-peer manner‌ within a communication network.‌​‌ We leverage the so-called​​ dual approach for decentralized​​​‌ optimization and propose a‌ Byzantine-robust algorithm. We provide‌​‌ convergence guarantees in the​​​‌ average consensus subcase, discuss​ the potential of the​‌ dual approach beyond this​​ subcase, and re-interpret existing​​​‌ algorithms using the dual​ framework, under the general​‌ update rule presented in​​ Figure 22. Lastly,​​​‌ we experimentally show the​ soundness of our method.​‌

Figure 22

Figure

Figure 22:​​ This Figure shows the​​​‌ main update of the​ dual robust algorithm.
A​‌ Theoretical Framework for Grokking:​​ Interpolation followed by Riemannian​​​‌ Norm Minimisation

Participants: Etienne​ Boursier, Stott Pesme​‌, Radu-Alexandru Dragomir.​​

In 25, we​​​‌ study the dynamics of​ gradient flow with small​‌ weight decay on general​​ training losses F.​​​‌ Under mild regularity assumptions​ and assuming convergence of​‌ the unregularised gradient flow,​​ we show that the​​​‌ trajectory with weight decay​ λ exhibits a two-phase​‌ behaviour as λ→​​0. During the​​​‌ initial fast phase, the​ trajectory follows the unregularised​‌ gradient flow and converges​​ to a manifold of​​​‌ critical points of F​ . Then, at time​‌ of order 1/​​λ, the trajectory​​​‌ enters a slow drift​ phase and follows a​‌ Riemannian gradient flow minimising​​ the 2-norm​​​‌ of the parameters. This​ purely optimisation-based phenomenon offers​‌ a natural explanation for​​ the grokking effect observed​​​‌ in deep learning, where​ the training loss rapidly​‌ reaches zero while the​​ test loss plateaus for​​​‌ an extended period before​ suddenly improving. We argue​‌ that this generalisation jump​​ can be attributed to​​​‌ the slow norm reduction​ induced by weight decay,​‌ as explained by our​​ analysis. We validate this​​​‌ mechanism empirically on several​ synthetic regression tasks. This​‌ mechanism is illustrated in​​ Figure 23.

Figure 23

Figure​​​‌

Figure 23: This​ Figure illustrate the grokking​‌ mechanism.
Flow Matching for​​ Robust Simulation-Based Inference under​​​‌ Model Misspecification

Participants: Pierre-Louis​ Ruhlmann, Pedro Rodrigues​‌, Michael Arbel,​​ Florence Forbes.

Simulation-based​​​‌ inference (SBI) is transforming​ experimental sciences by enabling​‌ parameter estimation in complex​​ non-linear models from simulated​​​‌ data. A persistent challenge,​ however, is model misspecification:​‌ simulators are only approximations​​ of reality, and mismatches​​​‌ between simulated and real​ data can yield biased​‌ or overconfident posteriors. In​​ 51 We address this​​​‌ issue by introducing Flow​ Matching Corrected Posterior Estimation​‌ (FMCPE), a framework that​​ leverages the flow matching​​​‌ paradigm to refine simulation-trained​ posterior estimators using a​‌ small set of real​​ calibration samples, as illustrated​​​‌ in Figure 24.​ Our approach proceeds in​‌ two stages: first, a​​ posterior approximator is trained​​​‌ on abundant simulated data;​ second, flow matching transports​‌ its predictions toward the​​ true posterior supported by​​​‌ real observations, without requiring​ explicit knowledge of the​‌ misspecification. This design enables​​ FMCPE to combine the​​​‌ scalability of SBI with​ robustness to distributional shift.​‌ Across synthetic benchmarks and​​ real-world datasets, we show​​​‌ that our proposal consistently​ mitigates the effects of​‌ misspecification, delivering improved inference​​ accuracy and uncertainty calibration​​​‌ compared to standard SBI​ baselines, while remaining computationally​‌ efficient.

Figure 24

Figure

Figure 24​​: High-level description of​​​‌ FMCPE algorithm.
Simulation-based inference​ of yeast centromeres

Participants:​‌ Eloïse Touron, Pedro​​ Rodrigues, Julyan Arbel​​, Nelle Varoquaux,​​​‌ Michael Arbel.

The‌ chromatin folding and the‌​‌ spatial arrangement of chromosomes​​ in the cell play​​​‌ a crucial role in‌ DNA replication and genes‌​‌ expression. An improper chromatin​​ folding could lead to​​​‌ malfunctions and, over time,‌ diseases. For eukaryotes, centromeres‌​‌ are essential for proper​​ chromosome segregation and folding.​​​‌ Despite extensive research using‌ de novo sequencing of‌​‌ genomes and annotation analysis,​​ centromere locations in yeasts​​​‌ remain difficult to infer‌ and are still unknown‌​‌ in most species. Recently,​​ genome-wide chromosome conformation capture​​​‌ coupled with next-generation sequencing‌ (Hi-C) has become one‌​‌ of the leading methods​​ to investigate chromosome structures.​​​‌ Some recent studies have‌ used Hi-C data to‌​‌ give a point estimate​​ of each centromere, but​​​‌ those approaches highly rely‌ on a good pre-localization.‌​‌ In 39, we​​ present a novel approach​​​‌ that infers in a‌ stochastic manner the locations‌​‌ of all centromeres in​​ budding yeast based on​​​‌ both the experimental Hi-C‌ map and simulated contact‌​‌ maps using a neural​​ network model as illustrated​​​‌ in Figure 25.‌

Figure 25

Figure

Figure 25:‌​‌ Architecture of the tranformer-based​​ model.
Dual Perspectives on​​​‌ Non-Contrastive Self-Supervised Learning

Participants:‌ Jean Ponce, Basile‌​‌ Terver, Martial Hebert​​, Michael Arbel.​​​‌

The stop gradient and‌ exponential moving average iterative‌​‌ procedures are commonly used​​ in non-contrastive approaches to​​​‌ self-supervised learning to avoid‌ representation collapse, with excellent‌​‌ performance in downstream applications​​ in practice. In 50​​​‌, we investigate these‌ procedures from the dual‌​‌ viewpoints of optimization and​​ dynamical systems. We show​​​‌ that, in general, although‌ they do not optimize‌​‌ the original objective, or​​ any other smooth function,​​​‌ they do avoid collapse.‌ Following prior work, but‌​‌ without any of the​​ extra assumptions used in​​​‌ their proofs, we then‌ show using a dynamical‌​‌ system perspective that, in​​ the linear case, minimizing​​​‌ the original objective function‌ without the use of‌​‌ a stop gradient or​​ exponential moving average always​​​‌ leads to collapse, as‌ shown in Figure 26‌​‌. Conversely, we characterize​​ explicitly the equilibria of​​​‌ the dynamical systems associated‌ with these two procedures‌​‌ in this linear setting​​ as algebraic varieties in​​​‌ their parameter space, and‌ show that they are,‌​‌ in general, asymptotically stable​​. Our theoretical findings​​​‌ are illustrated by empirical‌ experiments with real and‌​‌ synthetic data.

Figure 26

Figure

Figure​​ 26: Illustration of​​​‌ the optimization landscape for‌ the objective funtion used‌​‌ in non-contrastive self-supervised learning.​​
Learning Theory for Kernel​​​‌ Bilevel Optimization

Participants: Fares‌ El Khoury, Edouard‌​‌ Pauwels, Samuel Vaiter​​, Michael Arbel.​​​‌

Bilevel optimization has emerged‌ as a technique for‌​‌ addressing a wide range​​ of machine learning problems​​​‌ that involve an outer‌ objective implicitly determined by‌​‌ the minimizer of an​​ inner problem. In 27​​​‌, we investigate the‌ generalization properties for kernel‌​‌ bilevel optimization problems where​​ the inner objective is​​​‌ optimized over a Reproducing‌ Kernel Hilbert Space. This‌​‌ setting enables rich function​​ approximation while providing a​​​‌ foundation for rigorous theoretical‌ analysis. In this context,‌​‌ we establish novel generalization​​​‌ error bounds for the​ bilevel problem under finite-sample​‌ approximation. Our approach adopts​​ a functional perspective, inspired​​​‌ by (Petrulionyte et al.,​ 2024), and leverages tools​‌ from empirical process theory​​ and maximal inequalities for​​​‌ degenerate -processes to derive​ uniform error bounds. The​‌ results rely on an​​ equivalence we establish between​​​‌ the estimator implemented in​ practice and an abstract​‌ one derived using the​​ functional perspective that is​​​‌ more amenable to a​ statistical analysis, as shown​‌ in Figure 27.​​ These generalization error estimates​​​‌ allow to characterize the​ statistical accuracy of gradient-based​‌ methods applied to the​​ empirical discretization of the​​​‌ bilevel problem.

Figure 27

Figure

Figure​ 27: A commutative​‌ diagram illustrating that plug-in​​ statistical estimation and differentiation​​​‌ can be interchanged.
EquiTabPFN:​ A Target-Permutation Equivariant Prior​‌ Fitted Network

Participants: Michael​​ Arbel, David Salinas​​​‌, Frank Hutter.​

Recent foundational models for​‌ tabular data, such as​​ TabPFN, have demonstrated remarkable​​​‌ effectiveness in adapting to​ new tasks through in-context​‌ learning. However, these models​​ overlook a crucial equivariance​​​‌ property: the arbitrary ordering​ of target dimensions should​‌ not influence model predictions.​​ In 22, we​​​‌ identify this oversight as​ a source of incompressible​‌ error, termed the equivariance​​ gap, which introduces instability​​​‌ in predictions. To mitigate​ these issues, we propose​‌ a novel model designed​​ to preserve equivariance across​​​‌ output dimensions, as shown​ in Figure 28.​‌ Our experimental results indicate​​ that our proposed model​​​‌ not only addresses these​ pitfalls effectively but also​‌ achieves competitive benchmark performance.​​

Figure 28

Figure

Figure 28:​​​‌ Overview of EquiTabPFN’s architecture.​

8.3 Scientific Imaging and​‌ Remote Sensing

A New​​ Statistical Model of Star​​​‌ Speckles for Learning to​ Detect and Characterize Exoplanets​‌ in Direct Imaging Observations​​

Participants: Theo Bodrito,​​​‌ Olivier Flasseur, Julien​ Mairal, Jean Ponce​‌, Maud Langlois,​​ Anne-Marie Lagrange.

The​​​‌ search for exoplanets is​ an active field in​‌ astronomy, with direct imaging​​ as one of the​​​‌ most challenging methods due​ to faint exoplanet signals​‌ buried within stronger residual​​ starlight. Successful detection requires​​​‌ advanced image processing to​ separate the exoplanet signal​‌ from this nuisance component.​​ The paper 24 presents​​​‌ a novel statistical model​ that captures nuisance fluctuations​‌ using a multiscale approach,​​ leveraging problem symmetries and​​​‌ a joint spectral channel​ representation grounded in physical​‌ principles. Our model integrates​​ into an interpretable, end-to-end​​​‌ learnable framework for simultaneous​ exoplanet detection and flux​‌ estimation. The proposed algorithm​​ is evaluated against the​​​‌ state of the art​ using datasets from the​‌ SPHERE instrument operating at​​ the Very Large Telescope​​​‌ (VLT). It significantly improves​ the precision-recall tradeoff, notably​‌ on challenging datasets that​​ are otherwise unusable by​​​‌ astronomers. The proposed approach​ is computationally efficient, robust​‌ to varying data quality,​​ and well suited for​​​‌ large-scale observational surveys. The​ model is illustrated in​‌ Figure 29.

Figure 29

Figure​​

Figure 29: Illustration​​​‌ of the ExoMild model​
Unsupervised Imaging Inverse Problems​‌ with Diffusion Distribution Matching​​

Participants: Giacomo Meanti,​​​‌ Thomas Ryckeboer, Michael​ Arbel, Julien Mairal​‌.

This work 35​​ addresses image restoration tasks​​ through the lens of​​​‌ inverse problems using unpaired‌ datasets. In contrast to‌​‌ traditional approaches—which typically assume​​ full knowledge of the​​​‌ forward model or access‌ to paired degraded and‌​‌ ground-truth images—the proposed method​​ operates under minimal assumptions​​​‌ and relies only on‌ small, unpaired datasets. This‌​‌ makes it particularly well-suited​​ for real-world scenarios, where​​​‌ the forward model is‌ often unknown or mis-specified,‌​‌ and collecting paired data​​ is costly or infeasible.​​​‌ The method leverages conditional‌ flow matching to model‌​‌ the distribution of degraded​​ observations, while simultaneously learning​​​‌ the forward model via‌ a distribution-matching loss that‌​‌ arises naturally from the​​ framework. Empirically, it outperforms​​​‌ both single-image blind and‌ unsupervised approaches on deblurring‌​‌ and non-uniform point spread​​ function (PSF) calibration tasks.​​​‌ It also matches state-of-the-art‌ performance on blind super-resolution.‌​‌ We also showcase the​​ effectiveness of our method​​​‌ with a proof of‌ concept for lens calibration:‌​‌ a real-world application traditionally​​ requiring timeconsuming experiments and​​​‌ specialized equipment. In contrast,‌ our approach achieves this‌​‌ with minimal data acquisition​​ effort. This approach is​​​‌ illustrated in Figure 30‌.

Figure 30

Figure

Figure 30‌​‌: Illustration of our​​ unsupervised learning approach for​​​‌ inverse problems.
Optimal transport‌ unlocks end-to-end learning for‌​‌ single-molecule localization

Participants: Romain​​ seailles, Jean-Baptiste Masson​​​‌, Jean Ponce,‌ Julien Mairal.

Single-molecule‌​‌ localization microscopy (SMLM) allows​​ reconstructing biology-relevant structures beyond​​​‌ the diffraction limit by‌ detecting and localizing individual‌​‌ fluorophores – fluorescent molecules​​ stained onto the observed​​​‌ specimen – over time‌ to reconstruct super-resolved images.‌​‌ Currently, efficient SMLM requires​​ non-overlapping emitting fluorophores, leading​​​‌ to long acquisition times‌ that hinders live-cell imaging.‌​‌ Recent deep-learning approaches can​​ handle denser emissions, but​​​‌ they rely on variants‌ of non-maximum suppression (NMS)‌​‌ layers, which are unfortunately​​ non-differentiable and may discard​​​‌ true positives with their‌ local fusion strategy. In‌​‌ this work 52,​​ we reformulate the SMLM​​​‌ training objective as a‌ set-matching problem, deriving an‌​‌ optimal-transport loss that eliminates​​ the need for NMS​​​‌ during inference and enables‌ end-to-end training. Additionally, we‌​‌ propose an iterative neural​​ network that integrates knowledge​​​‌ of the microscope's optical‌ system inside our model.‌​‌ Experiments on synthetic benchmarks​​ and real biological data​​​‌ show that both our‌ new loss function and‌​‌ architecture surpass the state​​ of the art at​​​‌ moderate and high emitter‌ densities. This approach is‌​‌ illustrated in Figure 31​​.

Figure 31

Figure

Figure 31​​​‌: Illustration of our‌ SMLM approach.
SpectralEarth: Training‌​‌ Hyperspectral Foundation Models at​​ Scale

Participants: Nassim Ait​​​‌ Ali Braham, C.‌ Albrecht, Julien Mairal‌​‌, Jocelyn Chanussot,​​ Y Wang, Xiao​​​‌ Xiang Zhu.

Foundation‌ models have triggered a‌​‌ paradigm shift in computer​​ vision and are increasingly​​​‌ being adopted in remote‌ sensing, particularly for multispectral‌​‌ imagery. Yet, their potential​​ in hyperspectral imaging (HSI)​​​‌ remains untapped due to‌ the absence of comprehensive‌​‌ and globally representative hyperspectral​​ datasets. To close this​​​‌ gap, in 4 we‌ introduce SpectralEarth, a large-scale‌​‌ multitemporal dataset designed to​​ pretrain hyperspectral foundation models​​​‌ leveraging data from the‌ environmental mapping and analysis‌​‌ program (EnMAP). SpectralEarth comprises​​​‌ 538 974 image patches​ covering 415 153 unique​‌ locations from 11 636​​ globally distributed EnMAP scenes​​​‌ spanning two years of​ archive. In addition, 17​‌.5% of​​ these locations include multiple​​​‌ timestamps, enabling multitemporal HSI​ analysis. Utilizing state-of-the-art selfsupervised​‌ learning algorithms, we pretrain​​ a series of foundation​​​‌ models on SpectralEarth, integrating​ a spectral adapter into​‌ classical vision backbones to​​ accommodate the unique characteristics​​​‌ of HSI. In tandem,​ we construct nine downstream​‌ datasets for land-cover, crop-type​​ mapping, and tree-species classification,​​​‌ providing benchmarks for model​ evaluation. Experimental results support​‌ the versatility of our​​ models and their generalizability​​​‌ across different tasks and​ sensors. We also highlight​‌ computational efficiency during model​​ fine-tuning. In Figure 32​​​‌, we compare the​ size of various datasets​‌ published for Earth observation.​​

Figure 32

Figure

Figure 32:​​​‌ Comparison of dataset sizes​ for remote sensing
MicroFlow:​‌ Domain-Specific Optical Flow for​​ Ground Deformation Estimation in​​​‌ Seismic Events

Participants: Juliette​ Bertrand, Sophie Giffard-Roisin​‌, James Hollingsworth,​​ Julien Mairal.

Dense​​​‌ ground displacement measurements are​ crucial for geological studies​‌ but are impractical to​​ collect directly. Traditionally, displacement​​​‌ fields are estimated using​ patch matching on optical​‌ satellite images from different​​ acquisition times. While deep​​​‌ learning-based optical flow models​ are promising, their adoption​‌ in ground deformation analysis​​ is hindered by challenges​​​‌ such as the absence​ of real ground truth,​‌ the need for sub-pixel​​ precision, and temporal variations​​​‌ due to geological or​ anthropogenic changes. In particular,​‌ we identify that deep​​ learning models relying on​​​‌ explicit correlation layers struggle​ at estimating small displacements​‌ in real-world conditions. Instead,​​ we propose a model​​​‌ that employs iterative refinements​ with explicit warping layers​‌ and a correlation-independent backbone,​​ enabling sub-pixel precision. Additionally,​​​‌ a non-convex variant of​ Total Variation regularization preserves​‌ fault-line sharpness while maintaining​​ smoothness elsewhere. Our model​​​‌ significantly outperforms widely used​ geophysics methods on semi-synthetic​‌ benchmarks and generalizes well​​ to challenging real-world scenarios​​​‌ captured by both medium-​ and high-resolution sensors. This​‌ work is available in​​ the paper 43 and​​​‌ is illustrated in Figure​ 33.

Figure 33

Figure

Figure​‌ 33: Illustration of​​ the MicroFlow approach.
Leveraging​​​‌ very high resolution optical​ remote sensing data and​‌ deep learning to assess​​ the potential for photovoltaïc​​​‌ energy production in urban​ areas

Participants: Alessia Boccalatte​‌, Jocelyn Chanussot.​​

Convolutional Neural Networks (CNNs)​​​‌ have shown remarkable success​ in remote sensing tasks.​‌ In urban contexts, recent​​ research has utilized CNNs​​​‌ to generate rooftop segmentation​ masks and determine rooftop​‌ section orientation from aerial​​ images. This cost-effective approach​​​‌ is especially valuable for​ large-scale rooftop solar potential​‌ estimations when detailed three-dimensional​​ data is unavailable. This​​​‌ research, published in 3​, introduces SolarMTNet, a​‌ novel multitask dense-prediction network​​ designed for rooftop solar​​​‌ potential prediction using only​ aerial images. Unlike previous​‌ studies that focus on​​ small manually labeled datasets​​​‌ (approximately 2000 scenes) and​ only segment rooftop orientations​‌ while typically assuming constant​​ slopes, SolarMTNet simultaneously segments​​​‌ both orientations and slopes,​ enhancing the accuracy of​‌ solar potential estimations by​​ 40%. SolarMTNet leverages a​​ large, automatically labeled dataset​​​‌ (up to 280000 scenes)‌ created from open-source Swis‌​‌ geospatial and aerial data,​​ significantly improving generalization. The​​​‌ model is trained on‌ rooftop data from the‌​‌ Zurich and Geneva cantons​​ and cross-validated on the​​​‌ Canton of Vaud, Switzerland.‌ The results show a‌​‌ mean Intersection over Union​​ (mIoU) of 0.67 for​​​‌ orientation segmentation and 0.40‌ for slope segmentation. The‌​‌ estimated irradiance exhibits an​​ absolute mean percentage difference​​​‌ of only 5% compared‌ to real solar cadaster‌​‌ data derived from detailed​​ model-based calculations, primarily du​​​‌ to shading issues. Finally,‌ SolarMTNet has also been‌​‌ tested in different geographica​​ areas outside Switzerland (France​​​‌ and Germany), demonstrating consistent‌ performance across diverse regions‌​‌ and pixel resolutions. The​​ quantification of urban solar​​​‌ potential losses from rooftop‌ superstructures via aerial imagery‌​‌ and Convolutional Neural Networks​​ has also been considered​​​‌ 2.

Hyperspectral Pansharpening‌

Participants: Jocelyn Chanussot.‌​‌

Hyperspectral (HS) pansharpening consists​​ of fusing a high-resolution​​​‌ panchromatic (PAN) band and‌ a low-resolution HS image‌​‌ to obtain a new​​ image with high resolution​​​‌ in both the spatial‌ and spectral domains. These‌​‌ remote sensing products are​​ valuable for a wide​​​‌ range of applications, driving‌ ever-growing research efforts. Nonetheless,‌​‌ results still do not​​ meet application demands. In​​​‌ part, this comes from‌ the technical complexity of‌​‌ the task: compared to​​ multispectral (MS) pansharpening, many​​​‌ more bands are involved,‌ in a spectral range‌​‌ only partially covered by​​ the PAN component and​​​‌ with overwhelming noise. However,‌ another major limiting factor‌​‌ is the absence of​​ a comprehensive framework for​​​‌ the rapid development and‌ accurate evaluation of new‌​‌ methods. This article attempts​​ to address this issue.​​​‌ We started by designing‌ a dataset large and‌​‌ diverse enough to allow​​ reliable training (for data-driven​​​‌ methods) and testing of‌ new methods. Then, we‌​‌ selected a set of​​ state-of-the-art methods, following different​​​‌ approaches characterized by promising‌ performance, and reimplemented them‌​‌ in a single PyTorch​​ framework. Finally, we carried​​​‌ out a critical comparative‌ analysis of all methods,‌​‌ using the most accredited​​ quality indicators. The analysis​​​‌ highlights the main limitations‌ of current solutions in‌​‌ terms of spectral/spatial quality​​ and computational efficiency, and​​​‌ it suggests promising research‌ directions 7.

On‌​‌ a related topic, another​​ work presents a critical​​​‌ survey of deep learning‌ in remote sensing image‌​‌ fusion 16.

Probing​​ Synergistic High-Order Interaction for​​​‌ Multi-Modal Image Fusion

Participants:‌ Jocelyn Chanussot.

Multi-modal‌​‌ image fusion aims to​​ generate a fused image​​​‌ by integrating and distinguishing‌ the cross-modality complementary information‌​‌ from multiple source images.​​ While the cross-attention mechanism​​​‌ with global spatial interactions‌ appears promising, it only‌​‌ captures second-order spatial interactions,​​ neglecting higher-order interactions in​​​‌ both spatial and channel‌ dimensions. This limitation hampers‌​‌ the exploitation of synergies​​ between multi-modalities. To bridge​​​‌ this gap, we introduce‌ in 21 a Synergistic‌​‌ High-order Interaction Paradigm (SHIP),​​ designed to systematically investigate​​​‌ spatial fine-grained and global‌ statistics collaborations between the‌​‌ multi-modal images across two​​ fundamental dimensions: 1) Spatial​​​‌ dimension: we construct spatial‌ fine-grained interactions through element-wise‌​‌ multiplication, mathematically equivalent to​​​‌ global interactions, and then​ foster high-order formats by​‌ iteratively aggregating and evolving​​ complementary information, enhancing both​​​‌ efficiency and flexibility. 2)​ Channel dimension: expanding on​‌ channel interactions with first-order​​ statistics (mean), we devise​​​‌ high-order channel interactions to​ facilitate the discernment of​‌ inter-dependencies between source images​​ based on global statistics.​​​‌ We further introduce an​ enhanced version of the​‌ SHIP model, called SHIP++​​ that enhances the cross-modality​​​‌ information interaction representation by​ the cross-order attention evolving​‌ mechanism, cross-order information integration,​​ and residual information memorizing​​​‌ mechanism. Harnessing high-order interactions​ significantly enhances our model’s​‌ ability to exploit multi-modal​​ synergies, leading in superior​​​‌ performance over state-of-the-art alternatives,​ as shown through comprehensive​‌ experiments across various benchmarks​​ in two significant multi-modal​​​‌ image fusion tasks: pan-sharpening,​ and infrared and visible​‌ image fusion.

Fully-Connected Transformer​​ for Multi-Source Image Fusion​​​‌

Participants: Jocelyn Chanussot.​

Multi-source image fusion combines​‌ the information coming from​​ multiple images into one​​​‌ data, thus improving imaging​ quality. This topic has​‌ aroused great interest in​​ the community. How to​​​‌ integrate information from different​ sources is still a​‌ big challenge, although the​​ existing self-attention based transformer​​​‌ methods can capture spatial​ and channel similarities. In​‌ this paper 19,​​ we first discuss the​​​‌ mathematical concepts behind the​ proposed generalized self-attention mechanism,​‌ where the existing self-attentions​​ are considered basic forms.​​​‌ The proposed mechanism employs​ multilinear algebra to drive​‌ the development of a​​ novel fully-connected self-attention (FCSA)​​​‌ method to fully exploit​ local and non-local domain-specific​‌ correlations among multi-source images.​​ Moreover, we propose a​​​‌ multi-source image representation embedding​ it into the FCSA​‌ framework as a non-local​​ prior within an optimization​​​‌ problem. Some different fusion​ problems are unfolded into​‌ the proposed fully-connected transformer​​ fusion network (FC-Former). More​​​‌ specifically, the concept of​ generalized self-attention can promote​‌ the potential development of​​ self-attention. Hence, the FC-Former​​​‌ can be viewed as​ a network model unifying​‌ different fusion tasks. Compared​​ with state-of-the-art methods, the​​​‌ proposed FC-Former method exhibits​ robust and superior performance,​‌ showing its capability of​​ faithfully preserving information.

GeoFlowNet-SAR:​​​‌ Earthquake Displacement Estimation from​ Synthetic Aperture Radar Images​‌

Participants: Jocelyn Chanussot.​​

Displacement estimation using remote​​​‌ sensing images is an​ effective approach for assessing​‌ surface displacement caused by​​ natural disasters like earthquakes​​​‌ and landslides. By employing​ pixel correlation algorithms, high-precision​‌ displacement maps can be​​ generated from images taken​​​‌ before and after surface​ movement. However, traditional methods​‌ often rely on spatial​​ regularization or frequency masking​​​‌ to reduce high-frequency noise,​ which can smooth spatial​‌ details and result in​​ biased displacement estimates, especially​​​‌ near sharp discontinuities typical​ of earthquake surface ruptures.​‌ Moreover, subpixel displacement estimation​​ using synthetic aperture radar​​​‌ (SAR) images remains a​ challenge compared to optical​‌ images, due to the​​ strong impact of speckle​​​‌ noise. This article 18​ presents GeoFlowNet-SAR, an innovative​‌ subpixel displacement estimation method​​ leveraging SAR images. SAR​​​‌ offers advantages thanks to​ all-weather observation and high​‌ penetration, making it suitable​​ for conditions typically challenging​​​‌ for optical systems in​ the visible light spectrum.​‌ This study uses Sentinel-1​​ SAR single look complex​​ (SLC) images with dual-polarization​​​‌ (VV and VH modes)‌ and interferometric wide (IW)‌​‌ swath mode to balance​​ coverage and resolution. By​​​‌ training on simulated displacement‌ datasets with realistic sharp‌​‌ discontinuities, GeoFlowNet-SAR directly predicts​​ surface displacement fields, providing​​​‌ highly efficient, robust, and‌ precise results while overcoming‌​‌ some limitations of traditional​​ methods.The effectiveness of the​​​‌ proposed methodological contribution is‌ first quantitatively demonstrated using‌​‌ synthetic simulated earthquake datasets,​​ including comparisons with state-of-the-art​​​‌ correlation methods. The method‌ is further validated using‌​‌ two real remote sensing​​ images from the 2019​​​‌ Ridgecrest earthquake and from‌ the 2023 Turkey–Syria earthquake.‌​‌ The observed results from​​ these real datasets confirm​​​‌ the effectiveness of GeoFlowNet-SAR‌ in practical applications.

Kolmogorov–Arnold‌​‌ Network for Hyperspectral Change​​ Detection

Participants: Jocelyn Chanussot​​​‌.

Hyperspectral change detection‌ (HCD) techniques to monitor‌​‌ Earth’s surface processes advanced​​ markedly in recent years.​​​‌ Seasonal variations and associated‌ spectral signatures as well‌​‌ as nonlinear noise patterns​​ emanating from sensors and​​​‌ atmospheric sources pose fundamental‌ challenges in HCD. Advanced‌​‌ deep learning models, such​​ as those that leverage​​​‌ convolutional neural networks (3D-Siamese)‌ or transformers (MLP-Mixer), are‌​‌ increasingly employed to address​​ these challenges. However, they​​​‌ often need substantial training‌ data and computational resources.‌​‌ Here, we show that​​ the Kolmogorov–Arnold network (KAN)​​​‌ can enhance HCD capabilities‌ without the excessive training‌​‌ demand of deep networks.​​ The Kolmogorov–Arnold theorem provides​​​‌ the theoretical foundation for‌ our approach, which is‌​‌ particularly well-suited for hyperspectral​​ data analysis by providing​​​‌ a rigorous basis for‌ handling high-dimensional spectral signatures‌​‌ through dimensional reduction and​​ feature extraction. Our architectural​​​‌ design employs this theoretical‌ framework by incorporating specialized‌​‌ neural network layers that​​ mirror the theorem’s compositional​​​‌ structure, thereby facilitating efficient‌ processing of spectral bands.‌​‌ By replacing the linear​​ weighting scheme with learnable​​​‌ nonlinear functions, the Kolmogorov–Arnold‌ network (KAN) provides a‌​‌ unique capability to capture​​ intricate patterns and irregularities​​​‌ in high-dimensional data. Here,‌ we compare five KAN-based‌​‌ architectures and deep learning​​ models such as the​​​‌ MLP-Mixer, 3D-Siamese, dual-branch Siamese‌ spatial–spectral Transformer attention network‌​‌ (DBS3TAN), and the Swin​​ Transformer for HCD and​​​‌ show that the Chebyshev-KAN‌ model, with an average‌​‌ overall accuracy of 97.35%​​ over four real-world benchmark​​​‌ cases, outperforms other models‌ while having a marked‌​‌ lower complexity than the​​ deep learning models. We​​​‌ also show that the‌ choice of fit nonlinear‌​‌ function and model structure​​ is more important than​​​‌ the number of parameters‌ in KAN-based models 15‌​‌.

ECSPLAIN: Explainability Constrained-claSsifier​​ for Pairing the detection​​​‌ and the Localization of‌ moving Areas from SAR‌​‌ INterferograms

Participants: Jocelyn Chanussot​​.

Detecting slope instabilities​​​‌ on synthetic aperture radar‌ (SAR) interferograms using deep‌​‌ learning approaches presents several​​ challenges. This detection task​​​‌ suffers from the lack‌ of transparency of deep‌​‌ networks, the complexity of​​ the input data (i.e.,​​​‌ complex values, sensitivity to‌ distortions, and presence of‌​‌ counterfactuals), and the complexity​​ of the target phenomena​​​‌ (i.e., the variable velocities‌ and the complex underground‌​‌ processes). In this article​​ 5, we propose​​​‌ a new framework called‌ explainability-constrained classifier for pairing‌​‌ the detection and the​​​‌ localization of moving areas​ on interferograms (ECSPLAIN), to​‌ generate decision, localization, and​​ segmentation maps from a​​​‌ single but explainable classifier​ network. It consists of​‌ training a classifier to​​ detect whether an instability​​​‌ is located in the​ patch or not, and​‌ to explain its decision​​ with a class activation​​​‌ map (CAM) that matches​ the actual location of​‌ the instability. Therefore, by​​ using a single classifier​​​‌ network, the framework can​ pair the detection and​‌ the localization of moving​​ areas. Four CAMs are​​​‌ investigated for the training​ of the ECSPLAIN framework.​‌ Experiments on the ISSLIDE​​ dataset show that our​​​‌ proposal achieves better explainability​ than standard a posteriori​‌ CAMs with more than​​ 0.20 points of improvement​​​‌ in terms of Dice​ and IoU scores. It​‌ also allows competitive performance​​ with segmentation-only networks, with​​​‌ only 0.04 points of​ difference in terms of​‌ Dice and intersection over​​ union (IoU) scores. Thus,​​​‌ the proposed method is​ competitive with the most​‌ efficient methods while being​​ lighter, faster, and delivering​​​‌ a decision based on​ a human-like reasoning process.​‌ Finally, the ECSPLAIN framework​​ is applied to enrich​​​‌ the ISSLIDE dataset, discovering​ more than 470 manually​‌ validated slope instabilities over​​ the Alps.

8.4 Other​​​‌ pluri-disciplinary projects

Challenges in​ Non-Polymeric Crystal Structure Prediction:​‌ Why a Geometric, Permutation-Invariant​​ Loss is Needed

Participants:​​​‌ Emmanuel Jehanno, Romain​ Menegaux, Julien Mairal​‌, Sergei Grudinin.​​

Crystalline structure prediction is​​​‌ an essential prerequisite for​ designing materials with targeted​‌ properties. Yet, it is​​ still an open challenge​​​‌ in materials design and​ drug discovery. Despite recent​‌ advances in computational materials​​ science, accurately predicting three-dimensional​​​‌ non-polymeric crystal structures remains​ elusive. In this work​‌ 47, we focus​​ on the molecular assembly​​​‌ problem, where a set​ S of identical rigid​‌ molecules is packed to​​ form a crystalline structure.​​​‌ Such a simplified formulation​ provides a useful approximation​‌ to the actual problem.​​ However, while recent state-of-the-art​​​‌ methods have increasingly adopted​ sophisticated techniques, the underlying​‌ learning objective remains ill-posed.​​ We propose a better​​​‌ formulation that introduces a​ loss function, illustrated in​‌ Figure 34, capturing​​ key geometric molecular properties​​​‌ while ensuring permutation invariance​ over S. Remarkably, we​‌ demonstrate that within this​​ framework, a simple regression​​​‌ model already outperforms prior​ approaches, including flow matching​‌ techniques, on the COD-Cluster17​​ benchmark, a curated non-polymeric​​​‌ subset of the Crystallography​ Open Database (COD).

Figure 34

Figure​‌

Figure 34: Illustration​​ of the geometric loss.​​​‌

9 Bilateral contracts and​ grants with industry

9.1​‌ Bilateral contracts with industry​​

Participants: Julien Mairal,​​​‌ Karteek Alahari, Pierre​ Gaillard.

In 2025,​‌ we had:

  • four CIFRE​​ PhD students with Meta:​​​‌ Timothée Darcet (co-advised by​ J. Mairal), who defended​‌ in June 2025, Eyal​​ Benaroche, who started in​​​‌ December 2025, Tariq Berrada​ Ifriqi (co-advised by K.​‌ Alahari), and Francois Porcher​​ (co-advised by K. Alahari),​​​‌ who started in April​ 2025.
  • one CIFRE PhD​‌ student with Naver Labs​​ Europe: Juliette Marrie (co-advised​​​‌ by J. Mairal and​ M. Arbel) who defended​‌ in June 2025.
  • one​​ CIFRE PhD student with​​ EDF R&D: Bianca Marin​​​‌ Moreno who defended in‌ October 2025 (co-advised by‌​‌ P. Gaillard).
  • one CIFRE​​ PhD student with Criteo:​​​‌ Julien Zhou (co-advised by‌ P. Gaillard).
  • one CIFRE‌​‌ PhD student with Ekimetrics:​​ Yedidia Agnimo (co-advised by​​​‌ K. Alahari), who started‌ in July 2025.
  • one‌​‌ CIFRE PhD student with​​ Enhance Lab: Vincent Herfeld​​​‌ (co-advised by J. Mairal).‌
  • a collaboration led by‌​‌ K. Alahari with Toyota​​ Motor Europe.

10 Partnerships​​​‌ and cooperations

10.1 International‌ initiatives

10.1.1 Participation in‌​‌ other International Programs

Project​​ EIFFEL

Participants: Karteek Alahari​​​‌, Pia Bideau.‌

  • Title:
    Efficient Distillation of‌​‌ Foundation Models for Computer​​ Vision
  • Duration:
    2025 -​​​‌ 2028
  • Summary:
    This collaborative‌ project with South Korea‌​‌ is supported by the​​ Institute of Information Communications​​​‌ Technology Planning & Evaluation‌ (IITP) grant funded by‌​‌ the Korean Government (MSIT)​​ (No. RS-2024-00457882, National AI​​​‌ Research Lab Project). Its‌ focus is on efficient‌​‌ foundation models. Foundation models,​​ which have been trained​​​‌ on massive amounts of‌ curated data by using‌​‌ huge resources, constitute one​​ of the most recent​​​‌ advancements in machine learning‌ for computer vision and‌​‌ other domains. These are​​ being typically produced by​​​‌ large corporations or as‌ part of industrial/academic collaborations,‌​‌ which raises fundamental challenges​​ for academia. One of​​​‌ the scientific objectives is‌ to widen the reach‌​‌ of these models by​​ proposing computationally efficient counterparts​​​‌ as well as variants‌ that leverage multiple modalities,‌​‌ e.g., text, image, video,​​ audio, collectively. In particular,​​​‌ we are interested in‌ developing new models under‌​‌ challenging but realistic scenarios,​​ such as limited data​​​‌ or data with temporally‌ evolving distribution, low computational‌​‌ resources, which occur in​​ many industrial and scientific​​​‌ applications.

10.2 European initiatives‌

10.2.1 Horizon Europe

APHELEIA‌​‌

APHELEIA project on cordis.europa.eu​​

  • Title:
    Reconciling Classical and​​​‌ Modern (Deep) Machine Learning‌ for Real-World Applications
  • Duration:‌​‌
    From September 1, 2023​​ to August 31, 2028​​​‌
  • Partners:
    • INSTITUT NATIONAL DE‌ RECHERCHE EN INFORMATIQUE ET‌​‌ AUTOMATIQUE (INRIA), France
  • Inria​​ contact:
    Mairal Julien
  • Summary:​​​‌

    Despite the undeniable success‌ of machine learning in‌​‌ addressing a wide variety​​ of technological and scientific​​​‌ challenges, the current trend‌ of training predictive models‌​‌ with an evergrowing number​​ of parameters from an​​​‌ evergrowing amount of data‌ is not sustainable. These‌​‌ huge models, often engineered​​ by large corporations benefiting​​​‌ from huge computational resources,‌ typically require learning a‌​‌ billion or more of​​ parameters. They have proven​​​‌ to be very effective‌ in solving prediction tasks‌​‌ in computer vision, natural​​ language processing, and computational​​​‌ biology, for example, but‌ they mostly remain black‌​‌ boxes that are hard​​ to interpret, computationally demanding,​​​‌ and not robust to‌ small data perturbations.

    With‌​‌ a strong emphasis on​​ visual modeling, the grand​​​‌ challenge of APHELEIA is‌ to develop a new‌​‌ generation of machine learning​​ models that are more​​​‌ robust, interpretable, and efficient,‌ and do not require‌​‌ massive amounts of data​​ to produce accurate predictions.​​​‌ To achieve this objective,‌ we will foster new‌​‌ interactions between classical signal​​ processing, statistics, optimization, and​​​‌ modern deep learning. Our‌ goal is to reduce‌​‌ the need for massive​​​‌ data by enabling scientists​ and engineers to design​‌ trainable machine learning models​​ that directly encode a​​​‌ priori knowledge of the​ task semantics and data​‌ formation process, while automatically​​ prefering simple and stable​​​‌ solutions over complex ones.​ These models will be​‌ built on solid theoretical​​ foundations with convergence and​​​‌ robustness guarantees, which are​ important to make real-life​‌ trustworthy predictions in the​​ wild. We will implement​​​‌ these ideas in an​ open-source software toolbox readily​‌ applicable to visual recognition​​ and inverse imaging problems,​​​‌ which will also handle​ other modalities. This will​‌ stimulate interdisciplinary collaborations, with​​ the potential to be​​​‌ a game changer in​ the way scientists and​‌ engineers design machine learning​​ problems.

10.2.2 Other european​​​‌ programs/initiatives

J. Chanussot is​ involved in a project​‌ funded by the European​​ Space Agency (ESA): ROSE-L​​​‌ in Harmony: EO Data​ Integration for Global Land​‌ Cover and Vegetation Mapping​​ led by the Canadian​​​‌ company C-Core (2025-2028)

10.3​ National initiatives

10.3.1 ANR​‌ Project BONSAI

Participants: Michael​​ Arbel.

  • Project BONSAI​​​‌ is a multi-disciplinary project​ aiming at integrating knowledge​‌ produced by experts, in​​ the form of simulators,​​​‌ into current machine learning​ frameworks through bilevel optimization​‌ for accurate and efficient​​ inference. We address three​​​‌ challenges. The first one​ is to develop a​‌ deep learning-based approach to​​ simulation-based inference that can​​​‌ adapt to data using​ bilevel optimization. A second​‌ challenge is to depoly​​ the methods to real-world​​​‌ problems which have their​ specificities. A third challenge​‌ is to develop bilevel​​ optimization methods that can​​​‌ handle the non-convexity and​ over-parameterization arising from using​‌ deep learning. The principal​​ investigator is Michael Arbel,​​​‌ and the project involves​ participants from Toulouse School​‌ of Economics, TIMC team​​ at UGA and other​​​‌ INRIA teams (Statify). This​ project started in April​‌ 2024.

10.3.2 MIAI chair:​​ Learning Visual Representations from​​​‌ Interaction for Robot Manipulation​ Tasks

Participants: Pia Bideau​‌, Karteek Alahari.​​

  • How to grasp an​​​‌ object has been studied​ in computer vision and​‌ robotics and several approaches​​ to this problem exist​​​‌ - either given a​ 3D shape of an​‌ object contact points are​​ determined that lead to​​​‌ a stable hand object​ configuration or an other​‌ line of work aims​​ at reconstructing stable hand​​​‌ object configurations modelling the​ reconstruction process of hand​‌ pose and object pose​​ jointly. In both cases​​​‌ many solutions are possible,​ although a majority might​‌ not be the natural​​ approach that humans would​​​‌ chose - mainly because​ the intention behind the​‌ grasp is omitted. This​​ project aims at learning​​​‌ visual representations from interaction​ that encode activity information.​‌ Encoding such contextual information​​ appears not only to​​​‌ be relevant to synthesise​ feasible grasps furthermore this​‌ is likely to enhance​​ future generalisation skills facilitating​​​‌ adaptation across the same​ activity but different objects​‌ - grasping a cup​​ to pour something into​​​‌ something shares similar motion​ pattern as grasping a​‌ bottle to pour something​​ into something. Inspired by​​​‌ the effectiveness of human​ grasping, we aim at​‌ finding similarly adaptable representations​​ that are capable of​​ guiding complex manipulation skills.​​​‌ To this end we‌ will fuse ideas relying‌​‌ on classical probabilistic modeling​​ of distributions over possible​​​‌ motion trajectories and latent‌ action representations from a‌​‌ conditional variational autoencoder (CVAE).​​ Both of these directions​​​‌ come with complementary strengths‌ and thus provide promising‌​‌ capabilities of modulating the​​ degree of action abstractions​​​‌ at test time to‌ enable both coarse and‌​‌ fine-grained control for real​​ world robot manipulation tasks.​​​‌ The chair is taking‌ place in collaboration with‌​‌ Karteek Alahari, Xavier Alameda-Pineda,​​ and Pierre-Brice Wieber. We​​​‌ have recruited one PhD‌ student and have an‌​‌ intern starting in February​​ 2025.

10.3.3 MIAI Cluster​​​‌ chair: MOnitoring natural Hazards‌ using AI and Remote‌​‌ sensing (MOHAIR)

Participants: Jocelyn​​ Chanussot.

  • J. Chanussot​​​‌ is the co-chair, with‌ Sophie Giffard-Roisin (IRD junior‌​‌ researcher, Laboratoire IsTerre) and​​ Yajing Yang (Associate Professor,​​​‌ LISTIC Univ. Savoie Mont-Blanc).‌ This project started in‌​‌ September 2025. It gathers​​ members from 7 different​​​‌ teams of 6 laboratories‌ in Grenoble, Annecy and‌​‌ Clermont-Ferrand.

    Satellite based remote​​ sensing, using a variety​​​‌ of sensing modalities (optical,‌ radar, hyperspectral, lidar) offers‌​‌ a unique source of​​ information to monitor the​​​‌ environment, with fine spatial‌ resolution, wide coverage and‌​‌ frequent revisit. This enables​​ addressing the challenge of​​​‌ natural hazard monitoring and‌ forecasting, which has a‌​‌ significant societal impact. To​​ fully harness the potential​​​‌ of remote sensing data,‌ advanced algorithms in machine‌​‌ learning, deep learning, or​​ more broadly artificial intelligence,​​​‌ must be developed. Gathering‌ an interdisciplinary team of‌​‌ experts, from data science,​​ environmental and Earth sciences,​​​‌ as well as social‌ sciences, this chair will‌​‌ focus on three important​​ topics: forest monitoring, Earth​​​‌ deformation estimation and volcanic‌ inverse modeling. From a‌​‌ methodological point of view,​​ research will be conducted​​​‌ on the analysis of‌ multimodal time series, multimodal‌​‌ deep and graph learning​​ and foundation models.

10.3.4​​​‌ MIAI chair: Fundamentals of‌ Reinforcement Learning

Participants: Pierre‌​‌ Gaillard.

  • P. Gaillard​​ is the co-chair, with​​​‌ Bruno Gaujal (LIG, UGA)‌ of this MIAI chair‌​‌ that focuses on developping​​ advanced methodologies for Reinforcement​​​‌ Learning (RL). The project‌ aims to develop new‌​‌ RL algorithms with strong​​ theoretical foundations and practical​​​‌ effectiveness by exploiting the‌ problem's inherent structure. The‌​‌ focus areas include online​​ control of queueing networks,​​​‌ weakly coupled stochastic dynamic‌ systems (sometimes associated with‌​‌ bandits) and parametric learning​​ for adaptive policies. These​​​‌ three approaches to structured‌ learning will be used‌​‌ for innovative applications in​​ energy, cloud computing, and​​​‌ resource allocation.

10.3.5 Deep‌ Red

Participants: Jocelyn Chanussot‌​‌.

  • J. Chanussot is​​ the chair of the​​​‌ Deep Red project from‌ the Foundation Grenoble INP‌​‌ under the patronage of​​ Lynred company (2022-2026). The​​​‌ project aims at popularizing‌ the technology of infrared‌​‌ imaging for new usages.​​

10.3.6 PEPR project Numpex​​​‌

Participants: Hadrien Hendrikx.‌

  • The 'Numpex' programme's objectives‌​‌ are to design and​​ develop the software building​​​‌ blocks required to equip‌ future 'exascale machines' and‌​‌ to prepare the major​​ application domains that aim​​​‌ to fully exploit the‌ capabilities of such machines‌​‌ for scientific research and​​​‌ industry alike. This project​ is part of France's​‌ response to the next​​ EuroHPC call for expressions​​​‌ of interest (Projet Exascale​ France) in hosting one​‌ of the two major​​ exascale machines planned in​​​‌ Europe for 2024. In​ this way 'Numpex' will​‌ contribute to the creation​​ of a set of​​​‌ tools, software, applications and​ training which will enable​‌ France to remain one​​ of the leaders in​​​‌ the field of international​ competition through its national​‌ Exascale ecosystem that is​​ in step with European​​​‌ strategy.

10.3.7 PEPR project​ Origins

Participants: Julien Mairal​‌.

  • Thoth is involved​​ in the axis “Direct​​​‌ imaging and exoplanet characterization”​ of the PEPR Origins.​‌ This is an on-going​​ collaboration with astronomers from​​​‌ Observatoire de Paris and​ Lyon and with the​‌ Willow team.

11 Dissemination​​

Participants: Julien Mairal,​​​‌ Karteek Alahari, Jocelyn​ Chanussot, Hadrien Hendrikx​‌, Michael Arbel,​​ Pierre Gaillard, Pia​​​‌ Bideau, Scott Pesme​.

11.1 Promoting scientific​‌ activities

11.1.1 Scientific events:​​ organisation

Member of the​​​‌ organizing committees
  • M. Arbel,​ P. Gaillard, H. Hendrikx,​‌ J. Mairal, G. Meanti,​​ S. Pesme and N.​​​‌ Gillot co-organized the PAISS​ summer school in Grenoble,​‌ which attracted about 200​​ students.
  • P. Gaillard co-organized​​​‌ with EDF R&D a​ workshop on Meta-models that​‌ attracted arround 50 participants.​​
  • J. Chanussot was the​​​‌ general co-chair (with Prof​ Xiuping Jia, University of​‌ New South Wales, and​​ Prof Jeffrey Walker, Monash​​​‌ University) of the IEEE​ Geoscience and Remote Sensing​‌ Symposium (IGARSS) that attracted​​ 3200 participants in Brisbane,​​​‌ Australia, august 3-8 2025.​
  • J. Chanussot was the​‌ co-chair of GeoCV (First​​ Workshop on Computer Vision​​​‌ for Geospatial Image Analysis)​ at the IEEE/CVF Winter​‌ Conference on Applications of​​ Computer Vision (WACV workshop)​​​‌ Tucson, AZ, March, 2025.​
  • J. Chanussot was the​‌ co-chair of MORSE (Workshop​​ on Foundation and Large​​​‌ Vision Models in Remote​ Sensing) at the IEEE/CVF​‌ Conference on Computer Vision​​ and Pattern Recognition (CVPR​​​‌ Workshop) Nashville, TN, June​ 2025.

11.1.2 Scientific events:​‌ selection

Chair of conference​​ program committees
  • K. Alahari​​​‌ will be a program​ co-chair for ECCV 2028​‌ (Bucharest, Romania).
  • J. Chanussot​​ will be the Technical​​​‌ Program Committee chair for​ the IEEE Geoscience and​‌ Remote Sensing Symposium (IGARSS)​​ to be held in​​​‌ Reykjavik, Iceland in 2027​
Member of the conference​‌ program committees
  • K. Alahari​​ was an area chair​​​‌ for CVPR 2025, ICCV​ 2025, NeurIPS 2025, and​‌ will be area chair​​ for upcoming ICML 2026.​​​‌
  • P. Gaillard was an​ area chair for ICML​‌ 2025, and will be​​ area chair for upcoming​​​‌ ICML 2026.
  • M. Arbel​ was an area chair​‌ for NeurIPS 2025, and​​ will be area chair​​​‌ for the upcoming ICML​ 2026.
  • J. Mairal will​‌ be an area chair​​ for the upcoming ICML​​​‌ 2026.
Reviewer
  • J. Mairal​ was reviewer for ICCV​‌ 2025, ICLR 2026 and​​ NeurIPS 2025 (where he​​​‌ received a top reviewer​ award).
  • K. Alahari was​‌ reviewer for CVPR 2026,​​ BMVC 2025.
  • P. Gaillard​​​‌ was reviewer for NeurIPS​ 2025.
  • H. Hendrikx was​‌ reviewer for ICML 2025​​
  • M. Arbel was reviewer​​ for AISTATS 2025 (where​​​‌ he received a top‌ reviewer award), ICCV 2025,‌​‌ ICLR 2026.

11.1.3 Journal​​

Member of the editorial​​​‌ boards
  • J. Mairal. Editor‌ for Journal of Machine‌​‌ Learning Reseach (JMLR).
  • K.​​ Alahari. Associate editor of​​​‌ International Journal of Computer‌ Vision (IJCV).
  • J. Chanussot‌​‌ is an Associate Editor​​ for the IEEE Transactions​​​‌ on Geoscience and Remote‌ Sensing
Reviewer - reviewing‌​‌ activities
  • P. Gaillard was​​ reviewer for JMLR.
  • H.​​​‌ Hendrikx was reviewer for‌ JMLR and SIOPT
  • M.‌​‌ Arbel was reviewer for​​ JMLR.

11.1.4 Invited talks​​​‌

  • J. Mairal was an‌ invited speaker at the‌​‌ BASP workshop, Villars sur​​ Ollon. Feb. 2025.
  • J.​​​‌ Mairal was an invited‌ speaker at the OSKI‌​‌ workshop, Aussois. March 2025.​​
  • J. Mairal was an​​​‌ invited speaker at the‌ GDR-IASIS workshop, Lyon. March‌​‌ 2025.
  • J. Mairal was​​ an invited speaker at​​​‌ Academie des Sciences (inter-section‌ meeting). June 2025.
  • J.‌​‌ Mairal gave an invited​​ seminar at the ELLIS​​​‌ Stuttgart unit. June 2025.‌
  • J. Mairal was an‌​‌ invited speaker at the​​ Non-convex optimization: landscapes, dynamics​​​‌ and learning workshop, EPFL,‌ Aug. 2025.
  • J. Mairal‌​‌ was an invited speaker​​ at the GDR-IASIS workshop,​​​‌ Paris. Sept. 2025.
  • K.‌ Alahari was a keynote‌​‌ speaker at the Inria-Waterloo​​ workshop at Univ. Waterloo,​​​‌ Canada. May 2025
  • K.‌ Alahari was an invited‌​‌ speaker at Journées de​​ statistique de la SFdS,​​​‌ Marseille. June 2025.
  • K.‌ Alahari was an invited‌​‌ speaker at the Global​​ AI Frontiers Symposium, Seoul,​​​‌ South Korea. Oct. 2025.‌
  • K. Alahari was an‌​‌ invited speaker at the​​ Open Science Days@UGA, Grenoble.​​​‌ Nov. 2025.
  • K. Alahari‌ was a keynote speaker‌​‌ at the Sfen workshop​​ on apport de l'IA​​​‌ a la science des‌ materiaux pour l'industrie nucleaire,‌​‌ Paris. Dec. 2025.
  • S.​​ Pesme was an invited​​​‌ speaker at Journée scientifique‌ du groupe SMAI-SIGMA, Dec.‌​‌ 2025, Paris.
  • S. Pesme​​ gave an invited seminar​​​‌ at Centrale Supelec. Nov.‌ 2025.
  • S. Pesme gave‌​‌ a talk at the​​ Oberwolfach Mini-Workshop on Probabilistic​​​‌ Perspectives in Neural Network-Based‌ Machine Learning, Oct. 2025,‌​‌ Oberwolfach, Germany
  • S. Pesme​​ gave an invited talk​​​‌ at the Workshop sur‌ les modèles génératifs :‌​‌ diffusion, flow matching et​​ leurs applications, Oct. 2025,​​​‌ Lyon.
  • S. Pesme gave‌ an invited seminar at‌​‌ Eindhoven University of Technology,​​ Netherlands, Oct 2025
  • S.​​​‌ Pesme gave an invited‌ talk at the Workshop‌​‌ on the Statistical Theory​​ of Neural Networks, May​​​‌ 2025, University of Twente,‌ Netherlands.
  • J. Marrie gave‌​‌ an invited seminar at​​ Ecole des Ponts, Marne​​​‌ la Vallée, March 2025.‌
  • T. Darcet gave an‌​‌ invited talk at the​​ BLISS summer school, TU​​​‌ Berlin. May 2025.
  • T.‌ Bodrito gave a talk‌​‌ at the COBREX seminar.​​ Feb. 2025.
  • T. Bodrito​​​‌ gave a talk at‌ the Journées de la‌​‌ Société Française d'Astronomie (SF2A).​​ July 2025.
  • P. Gaillard​​​‌ was an invited speaker‌ at a scientific seminar‌​‌ organized by the LabEx​​ EnergyAlps. May 2025.
  • H.​​​‌ Hendrikx gave an invited‌ seminar at Inria Montpellier,‌​‌ May 2025.
  • H. Hendrikx​​ gave a talk at​​​‌ project Redeem (PEPR IA)‌ annual meeting, October 2025.‌​‌
  • M. Arbel was an​​​‌ invited speaker at the​ RKHS Seminars, METU, February​‌ 2025.

11.1.5 Scientific expertise​​

  • J. Mairal was a​​​‌ member of the Hemholtz​ panel on scientific imaging.​‌
  • J. Mairal was a​​ member of the Prairie​​​‌ panel for junior chairs.​
  • J. Mairal was a​‌ panel member for the​​ research council of Norway.​​​‌
  • K. Alahari was a​ member of the CRCN/ISFP​‌ 2025 recruitment committee at​​ Grenoble.
  • P. Gaillard was​​​‌ a reviewer for the​ JCJC call from ANR.​‌
  • H. Hendrikx was a​​ panel member for the​​​‌ TSIA call from ANR,​ subcommittee "IA & Environnements,​‌ écosystèmes, ressources biologiques"

11.1.6​​ Research administration

  • J. Mairal​​​‌ is a member of​ the scientific committee (COS)​‌ of Inria Grenoble's research​​ center, and also a​​​‌ member of the scientific​ committee of MIAI.
  • K.​‌ Alahari is the deputy​​ scientific director in charge​​​‌ of AI at Inria.​
  • K. Alahari is one​‌ of the scientific directors​​ of the PEPR IA​​​‌ national research programme.
  • K.​ Alahari is responsible for​‌ the Mathematics and Computer​​ Science specialist field at​​​‌ the MSTII doctoral school.​
  • K. Alahari is a​‌ member of commission prospection​​ postes at LJK.
  • H.​​​‌ Hendrikx is Chargé de​ mission Science Environnement Société​‌ (SEnS) for Inria Grenoble.​​
  • H. Hendrikx is the​​​‌ Inria Transformation Ecologique (TREC)​ representative at UGA.
  • H.​‌ Hendrikx is a leading​​ member of the Inria​​​‌ Grenoble socio-environmental roadmap.
  • J.​ Chanussot is a member​‌ of the Commission Recherche​​, University Grenoble Alpes.​​​‌

11.2 Teaching - Supervision​ - Juries - Educational​‌ and pedagogical outreach

11.2.1​​ Supervision

  • Théo Bodrito defended​​​‌ his Phd in June​ 2025. He was co-advised​‌ by Olivier Flasseur, Jean​​ Ponce and Julien Mairal.​​​‌ See the manuscript 41​.
  • Timothée Darcet defended​‌ his Phd in June​​ 2025. He was co-advised​​​‌ by Piotr Bojanowski, Maxim​ Oquab, and Julien Mairal.​‌ See the manuscript 42​​
  • Juliette Marrie defended her​​​‌ PhD in June 2025.​ She was co-advised by​‌ Michael Arbel, Diane Larlus​​ and Julien Mairal.
  • Bianca​​​‌ Marin Moreno defended her​ PhD in October 2025.​‌ She was co-advised by​​ P. Gaillard.
  • Zhiqi Kang​​​‌ defended his PhD in​ November 2025. He was​‌ advised by Karteek Alahari.​​
  • Anandaramane Candassamy,defended his PhD​​​‌ in september 2025. He​ was co-advised by J.​‌ Chanussot.
  • Colin Prieur defended​​ his PhD in november​​​‌ 2025. He was co-advised​ by J. Chanussot

11.2.2​‌ Juries

  • J. Mairal was​​ reviewer for the PhD​​​‌ thesis of Samuel Gruffaz,​ Univ. Paris Saclay. 2025.​‌
  • J. Mairal was reviewer​​ for the HdR of​​​‌ Thomas Moreau, Univ. Paris​ Saclay. 2025.
  • J. Mairal​‌ was a member of​​ the PhD committee of​​​‌ Gaspard Dussert, Univ. Lyon​ 1. 2025.
  • J. Mairal​‌ was a member of​​ the HdR commitee of​​​‌ Maxime Sangnier, PSL Sorbonne​ Université. 2025.
  • K. Alahari​‌ was a member of​​ the PhD jury of​​​‌ Mohammmed-Yasser Benigmim, IP Paris.​ 2025.
  • K. Alahari was​‌ a reviewer for the​​ PhD thesis of Corentin​​​‌ Sautier, Ecole des Ponts​ ParisTech. 2025.
  • K. Alahari​‌ was a reviewer for​​ the PhD thesis of​​​‌ Tanay Agrawal, Université Côte​ d'Azur. 2025.
  • K. Alahari​‌ was the president of​​ the PhD jury of​​ Guillaume Déau, Univ. Poitiers.​​​‌ 2025.
  • P. Gaillard was‌ reviewer for the PhD‌​‌ thesis of Lukas Zierahn,​​ Politecnico di Torino, Italy.​​​‌ 2025.
  • P. Gaillard was‌ a member of the‌​‌ PhD committee of Antoine​​ Picard, Univ. Lille. 2025.​​​‌
  • J. Chanussot was a‌ reviewer for the PhD‌​‌ of Liang Zhao, University​​ of South Australia (Australia)​​​‌ 2025.
  • J. Chanussot was‌ a reviewer for the‌​‌ PhD of Kimmo Riihiaho,​​ University of Jyväskylä (Finland)​​​‌ 2025.
  • J. Chanussot was‌ a reviewer for the‌​‌ PhD of Dan Pineau,​​ Université Paris-Saclay, 2025.
  • J.​​​‌ Chanussot was a reviewer‌ for the PhD of‌​‌ Yi Wang, TU Munich​​ (Germany) 2025.
  • J. Chanussot​​​‌ was a reviewer for‌ the PhD of Sai‌​‌ Reddy B., GITAM -​​ Deemed to be University​​​‌ (India) 2025.
  • J. Chanussot‌ was a reviewer for‌​‌ the PhD of Triem​​ Pham, Université Paris-Saclay, 2025.​​​‌
  • J. Chanussot was the‌ president of the PhD‌​‌ jury of Astrid Tazzioli,​​ Université PSL Paris, 2025.​​​‌
  • J. Chanussot was a‌ reviewer for the PhD‌​‌ of Ritu Yadav, KTH​​ (Sweden), 2025
  • J. Chanussot​​​‌ was a reviewer and‌ the president of the‌​‌ committee for the PhD​​ of Vadim Becquet, Université​​​‌ Paris PSL - Mines‌ de Paris, 2025
  • J.‌​‌ Chanussot was a reviewer​​ for the HdR of​​​‌ Minh-Tan PHAM, Université de‌ Bretagne Sud, 2025
  • M.‌​‌ Arbel was a member​​ of the PhD committee​​​‌ of Alessandro Pasqui, PSL‌ Université de Paris, 2025.‌​‌

11.2.3 Educational and pedagogical​​ outreach

  • Master: M. Arbel​​​‌ and J. Mairal, Kernel‌ methods for statistical learning,‌​‌ 36h eqTD, M2, ENS​​ Paris-Saclay/PSL, France.
  • Master: M.​​​‌ Arbel, J. Mairal and‌ S. Pesme, From Basic‌​‌ Machine Learning models to​​ Advanced Kernel Learning, 54h​​​‌ eqTD, M2, UGA, Grenoble.‌
  • Master: P. Gaillard, Sequential‌​‌ Learning, 12h eqTD, M2,​​ MVA, ENS Paris-Saclay, France.​​​‌
  • Master: H. Hendrikx, Numerical‌ Optimization, 40h eqTD, M1,‌​‌ UGA, Grenoble
  • Master: J.​​ Chanussot, Hyperspectral imaging, 25h​​​‌ eqTD, M2, Grenoble INP‌

11.3 Popularization

11.3.1 Productions‌​‌ (articles, videos, podcasts, serious​​ games, ...)

  • K. Alahari​​​‌ participated in a podcast‌ interview for Interstices 54‌​‌.

11.3.2 Participation in​​ Live events

  • K. Alahari​​​‌ co-animated the “Café IA"‌ event at Inria Grenoble.‌​‌
  • S. Pesme participated to​​ the “Ateliers scolaires les​​​‌ 9 et 10 octobre‌ au sein du parcours‌​‌ "Éclats de sciences" sur​​ le campus de l'Université​​​‌ Grenoble Alpes à Saint-Martin‌ d'Hères”.
  • S. Pesme participated‌​‌ two “Café IA”: at​​ Inria (September 30th 2025),​​​‌ and another with Digital‌ League (December 2nd 2025)Talk‌​‌ at the Math Olympiad​​ Ceremony (June 4, 2025,​​​‌ Université Grenoble-Alpes)
  • S. Pesme‌ participated to Classroom sessions‌​‌ for the "Semaine des​​ Maths" (March 10–19, 2025,​​​‌ schools in the Grenoble‌ academy)

11.3.3 Others science‌​‌ outreach relevant activities

  • S.​​ Pesme was interviewed for​​​‌ the fête de la‌ science 2025.
  • J.‌​‌ Chanussot is a member​​ of the scientific advisory​​​‌ board of the établissement‌ public de coopération culturelle‌​‌ « Territoire de Sciences​​ » with its two​​​‌ components: Cosmocité Museum and‌ Grenoble Casemate.
  • J. Chanussot‌​‌ organized a half-day event​​ about thermal imaging for​​​‌ the 10th grade students‌ doing their internship at‌​‌ INRIA.

12 Scientific production​​​‌

12.1 Publications of the​ year

International journals

International peer-reviewed​​​‌ conferences

Doctoral dissertations and​​ habilitation theses

Reports​ & preprints

Scientific​​ popularization

12.2 Cited publications​​​‌

  • 55 unpublishedC.Camila‌ Fernandez, C. S.‌​‌Chung Shue Chen,​​ P.Pierre Gaillard and​​​‌ A.Alonso Silva.‌ Experimental Comparison of Ensemble‌​‌ Methods and Time-to-Event Analysis​​ Models Through Integrated Brier​​​‌ Score and Concordance Index‌.2024, working‌​‌ paper or preprintHAL​​back to text
  1. 1​​​‌For example at the‌ Dutch national broadcast archive‌​‌ Netherlands Institute of Sound​​ and Vision, with whom​​​‌ we collaborated in the‌ EU FP7 project AXES,‌​‌ typically one or two​​ sentences are used in​​​‌ the metadata to describe‌ a one hour long‌​‌ TV program.