WILLOW

WILLOW - 2025

2025Activity‌ reportProject-TeamWILLOW

RNSR: 200718311C

Research center Inria‌ Paris Centre
In partnership with:Ecole normale supérieure‌ de Paris, CNRS
Team name: Embodied computer vision‌
In collaboration with:Département d'Informatique de l'Ecole Normale‌ Supérieure

Creation of the Project-Team: 2021 May 01‌

Each year, Inria research teams publish an Activity‌ Report presenting their work and results over the‌ reporting period. These reports follow a common structure,‌ with some optional sections depending on the specific‌ team. They typically begin by outlining the overall‌ objectives and research programme, including the main research‌ themes, goals, and methodological approaches. They also describe‌ the application domains targeted by the team, highlighting‌ the scientific or societal contexts in which their‌ work is situated.

The reports then present the highlights of the year,‌ covering major scientific achievements,‌ software developments, or teaching‌‌ contributions. When relevant, they include sections on software,‌ platforms, and open data,‌ detailing the tools developed‌‌ and how they are shared. A substantial part‌ is dedicated to new‌ results, where scientific contributions‌‌ are described in detail, often with subsections specifying‌ participants and associated keywords.‌

Finally, the Activity Report‌‌ addresses funding, contracts, partnerships, and collaborations at various‌ levels, from industrial agreements‌ to international cooperations. It‌‌ also covers dissemination and teaching activities, such as‌ participation in scientific events,‌ outreach, and supervision. The‌‌ document concludes with a presentation of scientific production,‌ including major publications and‌ those produced during the‌‌ year.

Keywords

Computer Science and Digital Science

A3.1.1.‌ Modeling, representation
A3.4. Machine‌ learning and statistics
A5.3.‌‌ Image processing and analysis
A5.10. Robotics
A9. Artificial‌ intelligence
A9.1. Knowledge
A9.2.‌ Machine learning
A9.3. Signal‌‌ processing
A9.5. Robotics and AI
A9.7. AI algorithmics‌
A9.12. Computer vision

Other‌ Research Topics and Application‌‌ Domains

B9.5.1. Computer science
B9.5.6. Data science

1‌ Team members, visitors, external‌ collaborators

Research Scientists

Justin‌‌ Carpentier [Team leader, INRIA, Researcher‌]
Stephane Caron [‌INRIA, Associate Professor‌‌ Detachement, HDR]
Shizhe Chen [INRIA‌, Researcher]
Yann‌ Dubois De Mont-Marin [‌‌INRIA, Starting Research Position, from May‌ 2025]
Frederike Dumbgen‌ [INRIA, Starting‌‌ Research Position, until Apr 2025]
Wilson‌ Jallet [INRIA,‌ Starting Research Position,‌‌ from May 2025]
Louis Montaut [INRIA‌, Starting Research Position‌, from May 2025‌‌]
Jean Ponce [ENS PARIS, Senior‌ Researcher, HDR]‌
Cordelia Schmid [INRIA‌‌, Senior Researcher, HDR]

Post-Doctoral Fellows‌

Ewen Dantec [ENS‌ Paris, Post-Doctoral Fellow‌‌, until Nov 2025]
Yann Dubois De‌ Mont-Marin [INRIA,‌ Post-Doctoral Fellow, until‌‌ Apr 2025]
Frederike Dumbgen [INRIA,‌ Post-Doctoral Fellow, from‌ May 2025]
Wilson‌‌ Jallet [INRIA, Post-Doctoral Fellow, until‌ Apr 2025]
Quentin‌ Le Lidec [INRIA‌‌, Post-Doctoral Fellow, until May 2025]‌
Etienne Menager [INRIA‌, Post-Doctoral Fellow]‌‌
Louis Montaut [INRIA, Post-Doctoral Fellow,‌ until Apr 2025]‌
Etienne Moullet [INRIA‌‌, Post-Doctoral Fellow, until Sep 2025]‌
Ajay Sathya [INRIA‌, Post-Doctoral Fellow]‌‌

PhD Students

Roland Andrews [INRIA]
Theo‌ Bodrito [INRIA,‌ until Apr 2025]‌‌
Thomas Chabal [INRIA, until Aug 2025‌]
Zerui Chen [‌INRIA, until Apr‌‌ 2025]
Ludovic De Matteis [UNIV TOULOUSE‌ III, until Apr‌ 2025]
Gabriel Fiastre‌‌ [INRIA]
Matthieu Futeral-Peter [INRIA,‌ until Jun 2025]‌
Ricardo Garcia Pinel [‌‌INRIA, until Jun 2025]
Francois Garderes‌ [LOUIS VUITTON,‌ CIFRE]
Umit Bora‌‌ Gokbakan [INRIA]
Zeeshan Khan [INRIA‌]
Theotime Le Hellard‌ [INRIA, from‌‌ Oct 2025]
Shiyao‌ Li [ENPC]
Javier Alejandro Lopetegui Gonzalez‌ [INRIA, from Nov 2025]
Imen‌ Mahdi [University of Freiburg]
Franki Nguimatsia‌ Tiofack [INRIA]
Paul Pacaud [INRIA‌]
Sara Pieri [INRIA]
François Porcher‌ [Meta, CIFRE]
Mathis Scheffler [INRIA‌, from Oct 2025]
Fabian Schramm [‌INRIA]
Romain Seailles [ENS Paris]‌
Federica Spinola [INRIA, from Oct 2025‌]
Basile Terver [FACEBOOK, CIFRE]‌
Valentin Tordjman–Levavasseur [INRIA, from Oct 2025‌]
Lucas Ventura [ENPC]
Elliot Vincent‌ [Ministère Transition, until Jun 2025]‌

Technical Staff

Etienne Arlaud [INRIA, Engineer‌]
Walid Bousselham [INRIA, Engineer,‌ from May 2025 until Jun 2025]
Riccardo‌ Cadei [INRIA, Engineer, from Mar‌ 2025 until Jul 2025]
Timothee Carecchio [‌INRIA, Engineer, from Sep 2025]‌
Aamr El Kazdadi [INRIA, Engineer,‌ from Oct 2025]
Pierre Fabre [INRIA‌, Engineer, from Jun 2025]
Lucas‌ Haubert [INRIA, Engineer]
Qikai Huang‌ [UNIV GEORGIA, Engineer, from Dec‌ 2025]
Peteris Kulits [INRIA, Engineer‌, from Jun 2025 until Nov 2025]‌
Louise Manson [INRIA, Engineer]
Jeanne‌ Matheron [INRIA, from Nov 2025]‌
Megane Millan [INRIA, Engineer, until‌ Oct 2025]
Louis Nel [INRIA,‌ Engineer, from Dec 2025]
Valentin Tordjman–Levavasseur‌ [INRIA, until Sep 2025]
Joris‌ Vaillant [INRIA, Engineer]

Interns and‌ Apprentices

Joaquin Austin Ferro [INRIA, Apprentice‌, from Sep 2025]
Timothee Carecchio [‌INRIA, Intern, from Feb 2025 until‌ Aug 2025]
Radu Cristian [INRIA,‌ Intern, from Mar 2025 until Aug 2025‌]
Theotime Le Hellard [INRIA, from‌ Sep 2025 until Sep 2025]
Theotime Le‌ Hellard [ENS Paris, Intern, from‌ Apr 2025 until Aug 2025]
Romain Li‌ [INRIA, Intern, from Jul 2025‌ until Sep 2025]
Armand Modjabi [INRIA‌, Intern, from Mar 2025 until Aug‌ 2025]
Louis Nel [INRIA, Intern‌, from Jun 2025 until Nov 2025]‌
Matthieu Rouet [ENS Paris, Intern,‌ from Jun 2025 until Aug 2025]
Mathis‌ Scheffler [INRIA, Intern, from Apr‌ 2025 until Sep 2025]

Administrative Assistant

Marina‌ Kovacic [INRIA]

Visiting Scientists

Qikai Huang‌ [UNIV GEORGIA, until Nov 2025]‌
Kim Jae Myung [UNIV TUBINGEN, until‌ May 2025]
Victor Klemm [INSTITUT ETH‌, from May 2025 until Jun 2025]‌
Michael Tarr [CMU, from Sep 2025‌ until Oct 2025]
Baohe Zhang [ University‌ of Freiburg, from Aug 2025 until Nov‌ 2025]

External Collaborators

Antoine Hoarau [Self-employee, from Oct 2025‌]
Theotime Le Hellard‌ [ENS de Paris‌‌, until Apr 2025]

2 Overall objectives‌

2.1 Statement

Building machines‌ that can automatically understand‌‌ complex visual inputs is one of the central‌ scientific challenges in artificial‌ intelligence. Truly successful visual‌‌ understanding technology will have a broad impact in‌ application domains as varied‌ as defense, entertainment, healthcare,‌‌ human-computer interaction, image retrieval and data mining, industrial‌ and personal robotics, manufacturing,‌ scientific image analysis, surveillance‌‌ and security, and transportation.

The problem is, however,‌ very difficult due to‌ the large variability of‌‌ the visual world and the high complexity of‌ the underling physical phenomena.‌ For example, people easily‌‌ learn how to perform complex tasks such as‌ changing a car tire‌ or performing resuscitation by‌‌ observing other people. This involves advanced visual perception‌ and interaction capabilities including‌ interpreting sequences of human‌‌ actions, learning new visuomotor skills from only a‌ few example demonstrations, grounding‌ instructions in appropriate scene‌‌ elements and actions, and applying the learned skills‌ in new environments and‌ situations. Currently, however, there‌‌ is no artificial system with a similar level‌ of cognitive visual competence.‌ Our goal for the‌‌ next 10 years is to develop models, methods‌ and algorithms providing sufficient‌ level of visual intelligence‌‌ to enable applications such as personal visual assistants‌ or home robots that‌ will, for example, prepare‌‌ a meal in response to a chat request.‌

Despite the tremendous progress‌ in visual recognition in‌‌ the last decade, current visual recognition systems still‌ require large amounts of‌ carefully annotated training data,‌‌ often use black-box architectures that do not model‌ the 3D physical nature‌ of the visual world,‌‌ are typically limited to simple pattern recognition tasks‌ such as detecting and‌ recognizing objects from a‌‌ predefined vocabulary, and do not capture real-world semantics.‌ We plan to address‌ these limitations with an‌‌ ambitious research program that aims at developing models‌ of the entire visual‌ understanding process from image‌‌ acquisition to the high-level embodied interpretation of visual‌ scenes. We target learnable‌ models that require minimal‌‌ to no supervision, support complex reasoning about visual‌ data, and are grounded‌ in interactions with the‌‌ physical world. More concretely, we will address fundamental‌ scientific challenges along three‌ research axes: (i) visual‌‌ recognition in images and videos with an emphasis‌ on weakly supervised learning‌ and models grounded in‌‌ the physical 3D world; (ii) learning embodied visual‌ representations for robotic manipulation‌ and locomotion; and (iii)‌‌ image restoration and enhancement. These challenges will be‌ tackled by a team‌ of researchers with core‌‌ expertise in computer vision and robotics, who will‌ simultaneously advance both fields‌ towards convergence. The complementary‌‌ expertise in areas such as machine learning and‌ natural language understanding will‌ be gained through collaboration‌‌ with relevant research teams.

We believe that foundational‌ research should be grounded‌ in applications and we‌‌ plan to pursue applications with high scientific, societal,‌ and/or economic impact in‌ domains such as transportation;‌‌ augmented reality; education; advanced‌ manufacturing; and quantitative visual analysis in sciences, humanities‌ and healthcare.

3 Research program

3.1 Visual recognition‌ and reconstruction of images and videos

It is‌ now possible to efficiently detect individual objects and‌ people in cluttered images and videos. Current methods,‌ however, rely on large-scale, manually-annotated image collections, often‌ use black-box architectures that do not model the‌ 3D physical nature of the visual world, and‌ are typically limited to simple pattern recognition tasks.‌ In this part of research program, we address‌ these fundamental limitations. In particular, we address the‌ following three key open challenges: (i) how to‌ leverage available but weak annotations including text, audio‌ and speech, (ii) how to enable automatic reasoning‌ about visual data, and (iii) how to develop‌ models grounded in the physical 3D world including‌ learnable models for 3D object and scene reconstruction.‌ We also continue theoretical work aimed at understanding‌ the geometric underpinnings of computer vision.

Our current‌ efforts in this area are outlined in detail‌ in Section. 8.1.

3.2 Learning embodied representations‌

Computer vision has come a long way toward‌ understanding images and videos in terms of scene‌ geometry, object labels, locations and poses of people‌ or classes of human actions. This “understanding”, however,‌ remains largely disconnected from reasoning about the physical‌ world. For example, what will happen when removing‌ a tablecloth from a set table? What actions‌ will be needed to resume an interrupted meal?‌ We believe that a true embodied understanding of‌ dynamic scenes from visual observations is the next‌ major research challenge. We address this challenge by‌ developing new models and algorithms with an emphasis‌ on the synergy between vision, learning, robotics and‌ natural language understanding. To this end, we study‌ learning methods for motion planning and optimal control‌ for known environments in state space. At the‌ same time, we develop models and algorithms for‌ learning visio-motor policies that do not rely on‌ the known structure of environments and instead integrate‌ visual perception directly into control algorithms. We also‌ address natural language providing additional modality for more‌ efficient learning and communication with emodied agents.

Our‌ current efforts in this area are outlined in‌ detail in Section 8.2.

3.3 Image restoration‌ and enhancement

Although image processing is a mature‌ field, it is more important than ever with‌ the advent of high-quality camera phones, scientific applications‌ in microscopy and astronomy and, recently, the emergence‌ of multi-modal sensing for autonomous cars for example.‌ In addition, it is an excellent proving ground‌ for learning-based techniques since (a) it is in‌ general (relatively) easy to generate realistic corrupted images‌ from clean ones since reasonable models of the‌ physical image corruption problem as often available (Abdelhamed‌ et al., 2019; Nah et al., 2017), and‌ (b) it is possible to incorporate natural image‌ priors such as self-similarities (Buades et al., 2005)‌ and sparsity (Mairal et al., 2009) in the‌ modelling and optimization processes. We have conducted work on image restoration since‌ the time of Julien‌ Mairal's PhD thesis, addressing‌‌ problems such as demosaicking, denoising, inpainting, and inverse‌ half-toning with a combination‌ of sparse coding/dictionary learning‌‌ methods and non-local means, then moving on to‌ blind deblurring including motion‌ segmentation and, more recently,‌‌ deep-learning methods. In our on-going efforts we address‌ several challenges for learning-based‌ approaches to image restoration:‌‌ (i) how to combine different modalities such as‌ depth and RGB images‌ to improve the quality‌‌ of the joint observations; (ii) how to construct‌ tunable, fully interpretable approaches‌ to image restoration in‌‌ a functional framework; and (iii) how to incorporate‌ machine learning methods that‌ go beyond the traditional‌‌ fully supervised setting into the image restoration pipeline.‌

Our current work in‌ this area is outlined‌‌ in detail in Section 8.4.

4 Application‌ domains

We believe that‌ foundational modeling work should‌‌ be grounded in applications. This includes (but is‌ not restricted to) the‌ following high-impact domains.

4.1‌‌ Automated visual assistants

The modern seamless video communication‌ has enabled new applications‌ in education, medicine and‌‌ manufacturing, such as remote surgery or remotely-supervised product‌ assembly. The abundance of‌ online instructional videos further‌‌ confirms the high demand of assistance including daily‌ tasks such as cooking‌ and gardening. Our work‌‌ on embodied video understanding and on the joint‌ modeling of vision and‌ language will support automatic‌‌ visual assistants. Similar to existing driving navigation assistants,‌ such applications will guide‌ people in daily living,‌‌ inspection and manufacturing tasks. Some of these applications‌ are studied within our‌ MSR-Inria collaboration.

4.2 Robotics‌‌

In 2023, the Willow team has pursued the‌ development of the Pinocchio‌ library both from a‌‌ scientific and software perspective. The recent versions of‌ Pinocchio now accounts for‌ closed-loop mechanisms (based on‌‌ a proximal optimization), code source generation on GPUs,‌ etc. All these new‌ features make Pinocchio a‌‌ unique tool to efficiently control complex robotic systems‌ such as legged robots‌ or industrial robots. We‌‌ are now closely collaborating with Pal Robotics which‌ plan to use Pinocchio‌ to control its next‌‌ generation of humanoid robots called Kangaroo. In the‌ near future, the plan‌ is to extend Pinocchio‌‌ to become a generic-purposed and efficient robotic simulator‌ simulating both rigid and‌ compliant contact interactions between‌‌ a robot and its environment, with the ambition‌ of making Pinocchio the‌ next golden framework for‌‌ simulation in robotics, offering advanced features for optimal‌ control, reinforcement learning, like‌ differentiable simulation. Such features‌‌ should position Pinocchio as the central simulator in‌ Robotics.

4.3 Image restoration‌

We are pursuing applications‌‌ of our image restoration work to personal photography,‌ to enhance the images‌ taken by consumer cameras‌‌ and smartphones by deblurring and denoising them, and‌ improving their spatial resolution‌ and dynamic range. In‌‌ this context, we are collaborating with DXOMark, the‌ world leader in smartphone‌ camera evaluation, through a‌‌ CIFRE thesis. Two of the objectives are to‌ develop a public database‌ of portraits fully compliant‌‌ with European GDRP regulations‌ with informed consent from the models, and to‌ automate the rating of image quality using this‌ dataset. We also apply the mixture of physical‌ image formation model and machine learning principles that‌ has made our image restoration work successful to‌ scientific fields: We collaborate with Anne-Marie Lagrange (Observatoire‌ de Paris), Maud Langlois (SNRS/Observatoire de Lyon) and‌ Julien Mairal (Inria) on direct exoplanet detection from‌ ground-based telescope imagery. This work also involves a‌ post-doc, Olivier Flasseur, and a PhD Student, Théo‌ Bodrito. We will apply next year the same‌ principles to molecular microscopy, in collaboration with Jean-Baptiste‌ Masson (Institut Pasteur).

5 Social and environmental responsibility‌

Artificial intelligence holds great potential for improving our‌ environment, for example, by reducing energy consumption and‌ optimizing energy production. Computer vision, in particular, can‌ be used to monitor emissions from coal plants‌ and to track forest growth using satellite imagery.‌ Autonomous drones can monitor and prevent failures of‌ pipelines, power lines, power plants and other remote‌ installations. However, as larger and more powerful AI‌ models require increased compute power at training and‌ deployment, AI itself stands for an increasingly high‌ carbon footprint. One direction of our research aims‌ to develop efficient and low-resource neural network models.‌ To this end, we have previously proposed Cross-Covariance‌ Image Transformers (El-Nouby et al. NeurIPS'21) that avoid‌ quadratic complexity in terms of image size. We‌ have been also working on the development of‌ new optimization methods and associated software (Bambade et‌ al. ICLR'24) to reduce the overall computationel burden‌ and reduce their energetical impact when applied to‌ industrial and practical scenarios. In the light of‌ these devleopments, with the help of the Inria‌ Soft infrastructure, we are considering creating a new‌ software consortium, named Maestro, to accelerate the‌ developement and the dissemination of efficient algorithmic solutions‌ for the control of robotics systems. One objective‌ of this consortium is to provide software solutions‌ that reduce the computational burden and energetic consumption‌ of modern robots currently deployed in industry or‌ in societal sectors.

6 Highlights of the year‌

6.1 Awards

Cordelia Schmid has received the Archimedes‌ Science Award, Dresden, 2025.
Cordelia Schmid has received‌ the Hans Fischer Senior Fellowship, TUM, 2025.
Cordelia‌ Schmid has received the ACM Athena Lecturer Award,‌ 2025.
Cordelia Schmid has received the Member of‌ the National Academy of Artificial Intelligence (NAAI), 2025.‌
Justin Carpentier and Cordelia Schmid have received the‌ Prix de thèse du GdR Robotique 2025 for‌ Q. Le Lidec.
Ajay Sathya has received the‌ IEEE Robotics and Automation Letters Best Paper Award‌.
Quentin Le Lidec has been awarded the‌ Best PhD Thesis Award in robotics by the‌ French national robotics network.
Antoine Bambade has received‌ the Prix Paul Caseau, awarded by the‌ French Academy of Technologies and EDF.
Stéphane Caron‌ has been awarded a PIQ grant, entitled‌ OSS4EAI.
Jean Ponce, together with Julien Mairal, have‌ been awarded an i-Lab award for their startup Enhance Lab.

7‌ Latest software developments, platforms,‌ open data

7.1 Latest‌‌ software developments

7.1.1 alignsdf

Keywords:
Computer vision, 3D‌ reconstruction
Functional Description:

This‌ is the PyTorch implementation‌‌ of the AlignSDF research paper:

AlignSDF: Pose-Aligned Signed‌ Distance Fields for Hand-Object‌ Reconstruction Zerui Chen, Yana‌‌ Hasson, Ivan Laptev, Cordelia Schmid ECCV 2022
Publication:‌
hal-03761124
Contact:
Zerui Chen‌
Participant:
4 anonymous participants‌‌

7.1.2 BLERC

Name:
Benchmarking Learning Efficiency in Deep‌ Reservoir Computing
Keywords:
Machine‌ learning, Continual Learning
Functional‌‌ Description:
Measuring learning efficiency of machine learning models.‌
URL:
https://github.com/hugcis/benchmark_learning_efficiency
Publication:
hal-03790477‌
Contact:
Hugo Cisneros

7.1.3‌‌ BurstSR

Name:
Super-resolution from image bursts
Keyword:
Image‌ processing
Functional Description:
This‌ is a research prototpye‌‌ allowing to take as input a sequence of‌ raw or rgb images‌ produced by a smartphone‌‌ or digital camera. This code produces a high‌ quality color images with‌ higher resolution.
Release Contributions:‌‌
This new version, v0.2, introduces various improvements, as‌ well as C++ code‌ that accelerates the original‌‌ Python code.
Publication:
hal-03323885
Contact:
Julien Mairal
Participant:‌
3 anonymous participants

7.1.4‌ FrozenBiLM

Name:
Zero-Shot Video‌‌ Question Answering via Frozen Bidirectional Language Models
Keywords:‌
Computer vision, Natural language‌ processing, Deep learning
Functional‌‌ Description:
Code, datasets and models associated with the‌ paper "Zero-Shot Video Question‌ Answering via Frozen Bidirectional‌‌ Language Models"
URL:
https://github.com/antoyang/FrozenBiLM
Contact:
Antoine Yang

7.1.5‌ hiveformer

Keywords:
Robotics, NLP,‌ Transformer
Functional Description:

This‌‌ is the PyTorch implementation of the Hiveformer research‌ paper:

Instruction-driven history-aware policies‌ for robotic manipulations Pierre-Louis‌‌ Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan‌ Laptev, Cordelia Schmid CoRL‌ 2022 (oral)
Publication:
guhur:hal-03775734‌‌
Contact:
Pierre-Louis Guhur
Participant:
6 anonymous participants

7.1.6‌ HM3DAutoVLN

Name:
Learning from‌ Unlabeled 3D Environments for‌‌ Vision-and-Language Navigation
Keyword:
Computer vision
Functional Description:
Open‌ source release of the‌ software package for the‌‌ ECCV'22 paper by Chen et al. "Learning from‌ Unlabeled 3D Environments for‌ Vision-and-Language Navigation". This release‌‌ provides a full implementation of the method, including‌ code for training models,‌ and testing on standard‌‌ datasets, generated datasets as well as trained models.‌
URL:
https://github.com/cshizhe/HM3DAutoVLN
Publication:
hal-03890196‌
Contact:
Shizhe Chen
Participant:‌‌
5 anonymous participants

7.1.7 Just Ask: Learning to‌ Answer Questions from Millions‌ of Narrated Videos

Keywords:‌‌
Computer vision, Natural language processing, Deep learning
Functional‌ Description:
Code, datasets and‌ models associated with the‌‌ paper "Just Ask: Learning to Answer Questions from‌ Millions of Narrated Videos"‌
URL:
https://github.com/antoyang/just-ask
Contact:
Antoine‌‌ Yang

7.1.8 Pinocchio

Name:
Pinocchio
Keywords:
Robotics, Biomechanics,‌ Mechanical multi-body systems
Functional‌ Description:
Pinocchio instantiates state-of-the-art‌‌ Rigid Body Algorithms for poly-articulated systems based on‌ revisited Roy Featherstone's algorithms.‌ In addition, Pinocchio instantiates‌‌ analytical derivatives of the main Rigid-Body Algorithms like‌ the Recursive Newton-Euler Algorithms‌ or the Articulated-Body Algorithm.‌‌ Pinocchio is first tailored for legged robotics applications,‌ but it can be‌ used in extra contexts.‌‌ It is built upon Eigen for linear algebra‌ and FCL for collision‌ detection. Pinocchio comes with‌‌ a Python interface for fast code prototyping.
URL:‌
https://github.com/stack-of-tasks/pinocchio
Contact:
Justin Carpentier‌
Partner:
CNRS

7.1.9 ProxSuite‌‌

Name:
ProxSuite
Keywords:
Conic‌ optimization, Linear optimization, Robotics
Functional Description:

ProxSuite is‌ a collection of open-source, numerically robust, precise and‌ efficient numerical solvers (e.g., LPs, QPs, etc.) rooted‌ in revisited primal-dual proximal algorithms. Through ProxSuite, we‌ aim to offer the community scalable optimizers that‌ can deal with dense, sparse or matrix-free problems.‌ While the first targeted application is Robotics, ProxSuite‌ can be used in other contexts without limits.‌

ProxSuite is actively developed and supported by the‌ Willow and Sierra research groups, joint research teams‌ between Inria, École Normale Supérieure de Paris and‌ Centre National de la Recherche Scientifique localized in‌ France.
Contact:
Justin Carpentier

7.1.10 SPE

Name:
Semantics‌ Preserving Encoder
Keywords:
NLP, Adversarial attack, Word embeddings‌
Functional Description:
Semantics Preserving Encoder is a simple,‌ fully supervised sentence embedding technique for textual adversarial‌ attacks.
URL:
https://github.com/DavidHerel/semantics-preserving-encoder
Contact:
Hugo Cisneros
Participant:
3‌ anonymous participants

7.1.11 TubeDETR

Name:
TubeDETR: Spatio-Temporal Video‌ Grounding with Transformers
Keywords:
Computer vision, Natural language‌ processing, Deep learning
Functional Description:
Code, datasets and‌ models associated with the paper "TubeDETR: Spatio-Temporal Video‌ Grounding with Transformers"
URL:
https://github.com/antoyang/TubeDETR
Contact:
Antoine Yang‌

7.1.12 vil3dref

Name:
Language Conditioned Spatial Relation Reasoning‌ for 3D Object Grounding
Keyword:
Computer vision
Functional‌ Description:
Open source release of the software package‌ for the NeurIPS'22 paper by Chen et al.‌ "Language Conditioned Spatial Relation Reasoning for 3D Object‌ Grounding". This release provides a full implementation of‌ the method, including code for training models, and‌ testing on standard datasets, as well as trained‌ models.
URL:
https://github.com/cshizhe/vil3dref
Publication:
hal-03890174
Contact:
Shizhe Chen‌
Participant:
5 anonymous participants

7.1.13 VLN-DUET

Name:
Think‌ Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language‌ Navigation
Keyword:
Computer vision
Functional Description:
Open source‌ release of the software package for the CVPR'22‌ paper by Chen et al. "Think Global, Act‌ Local: Dual-scale Graph Transformer for Vision-and-Language Navigation". This‌ release provides a full implementation of the method,‌ including codes for training models, and testing on‌ standard datasets, as well as trained models.
URL:‌
https://github.com/cshizhe/VLN-DUET
Publication:
hal-03696868
Contact:
Shizhe Chen
Participant:
5‌ anonymous participants

Participants: Jean Ponce, Justin Carpentier‌, Cordelia Schmid, Ivan Laptev, Etienne‌ Arlaud, Pierre-Guillaume Raverdy, Stephane Caron,‌ Shizhe Chen.

7.2 New platforms

Together with‌ SED, we are bulding the new robotics laboratory‌ at Inria Paris located on the 1st floor‌ of the A building. The lab hosts a‌ diverse set of robotic platforms covering dexterous manipulation,‌ legged locomotion, and mobile robotics. The current equipment‌ includes three UR5 robotic arms, an Allegro Hand,‌ a Shadow Hand, and a TIAGo++ robot integrating‌ both a mobile base and a manipulator. For‌ legged and mobile experiments, the lab includes Upkie‌ biped, the Unitree GO2 quadruped, and the ODRI‌ Solo-12 quadruped. In 2025, the laboratory expanded its‌ fleet with the acquisition of two Unitree G1‌ humanoid robots. The robotics laboratory is also equipped‌ with a dedicated Motion Capture system for precise‌ object localization and robot calibration. These robotic patforms will enable our future‌ research and experiments with‌ locomotion navigation and manipulation.‌‌

7.3 Open data

8 New results

8.1 Visual‌ recognition and reconstruction of‌ images and videos

8.1.1‌‌ MaskCaptioner: Learning to Jointly Segment and Caption Object‌ Trajectories in Videos

Participants:‌ Gabriel Fiastre, Antoine‌‌ Yang, Cordelia Schmid.

Dense Video Object‌ Captioning (DVOC) is the‌ task of jointly detecting,‌‌ tracking, and captioning object trajectories in a video,‌ requiring the ability to‌ understand spatio-temporal details and‌‌ describe them in natural language. Due to the‌ complexity of the task‌ and the high cost‌‌ associated with manual annotation, previous approaches resort to‌ disjoint training strategies, potentially‌ leading to suboptimal performance.‌‌ To circumvent this issue, we propose in this‌ work 44 to generate‌ captions about spatio-temporally localized‌‌ entities leveraging a state-of-the-art VLM. By extending the‌ LVIS and LV-VIS datasets‌ with our synthetic captions‌‌ (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end‌ model capable of jointly‌ detecting, segmenting, tracking and‌‌ captioning object trajectories. Moreover, with pretraining on LVISCap‌ and LV-VISCap, MaskCaptioner achieves‌ state-of-the-art DVOC results on‌‌ three existing benchmarks, VidSTG, VLN and BenSMOT. The‌ datasets and code are‌ available at here.‌‌

Figure 1: Example‌ of synthetic captions in‌‌ our LV-VISCap dataset.

8.1.2 ComposeAnything: Composite Object Priors‌ for Text-to-Image Generation

Participants:‌ Zeeshan Khan, Shizhe‌‌ Chen, Cordelia Schmid.

This paper 47‌ addresses the problem of‌ Compositional text-to-image generation. Current‌‌ text-to-image models struggle to generate scenes with many‌ objects and complex relations.‌ Training-time solutions such as‌‌ layout conditioning or reinforcement learning improve compositional accuracy‌ but often degrade image‌ quality and realism by‌‌ enforcing rigid constraints. To address this limitation, we‌ introduce ComposeAnything, an inference-only‌ framework that injects a‌‌ structured composite object prior directly into the diffusion‌ process. Rather than starting‌ from random latent noises‌‌ or performing expensive noise optimization, we construct a‌ single 2.5D composite prior‌ encoding strong object appearance,‌‌ counts, sizes, and coarse depth-aware placement, and use‌ it to initialize and‌ guide one diffusion trajectory.‌‌ This explicit prior is interpretable and editable in‌ image space, enabling human-in-the-loop‌ refinement by simply adjusting‌‌ the composite. Our training-free, backbone-agnostic method improves compositional‌ consistency on T2I-CompBench and‌ NSR-1K benchmarks, particularly for‌‌ complex prompts, while maintaining high visual quality compared‌ to both training-based baselines‌ and other inference-time methods.‌‌

Figure 2:‌ ComposeAnything enables text-to-image generation‌‌ for complex compositions involving surreal spatial relationships and‌ high object counts. Achieving‌ both high visual quality‌‌ and strong faithfulness to text.

8.1.3 FACap: A‌ Large-Scale Fashion Dataset for‌ Fine-grained Composed Image Retrieval‌‌

Participants: François Gardères, Camille-Sovanneary Gauthier, Shizhe‌ Chen, Jean Ponce‌.

The composed image‌‌ retrieval (CIR) task is to retrieve target images‌ given a reference image‌ and a modification text.‌‌ Recent methods for CIR‌ leverage large pretrained vision-language models (VLMs) and achieve‌ good performance on general-domain concepts like color and‌ texture. However, they still struggle with application domains‌ like fashion, because the rich and diverse vocabulary‌ used in fashion requires specific fine-grained vision and‌ language understanding. An additional difficulty is the lack‌ of large-scale fashion datasets with detailed and relevant‌ annotations, due to the expensive cost of manual‌ annotation by specialists. To address these challenges, we‌ introduce in this paper 33FACap, a‌ large-scale, automatically constructed fashion-domain CIR dataset. It leverages‌ web-sourced fashion images and a two-stage annotation pipeline‌ powered by a VLM and a large language‌ model (LLM) to generate accurate and detailed modification‌ texts. Then, we propose a new CIR model‌ FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model‌ on FACap with lightweight adapters and multi-head query-candidate‌ matching to better account for fine-grained fashion-specific information.‌ FashionBLIP-2 is evaluated with and without additional fine-tuning‌ on the Fashion IQ benchmark and the enhanced‌ evaluation dataset enhFashionIQ, leveraging our pipeline to obtain‌ higher-quality annotations. Experimental results show that the combination‌ of FashionBLIP-2 and pretraining with FACap significantly improves‌ the model's performance in fashion CIR especially for‌ retrieval with fine-grained modification texts, demonstrating the value‌ of our dataset and approach in a highly‌ demanding environment such as e-commerce websites.

Figure 3: Our automatically constructed FACap‌ dataset offers more detailed and accurate annotations than‌ existing datasets for the fashion CIR task.

8.1.4‌ Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs‌

Participants: Lucas Ventura, Antoine Yang, Cordelia‌ Schmid, Gül Varol.

We address the‌ task of video chaptering, i.e., partitioning a long‌ video timeline into semantic units and generating corresponding‌ chapter titles. While relatively underexplored, automatic chaptering has‌ the potential to enable efficient navigation and content‌ retrieval in long-form videos. In this paper 31‌, we achieve strong chaptering performance on hour-long‌ videos by efficiently addressing the problem in the‌ text domain with our 'Chapter-Llama' framework. Specifically, we‌ leverage a pretrained large language model (LLM) with‌ large context window, and feed as input (i)‌ speech transcripts and (ii) captions describing video frames,‌ along with their respective timestamps. Given the inefficiency‌ of exhaustively captioning all frames, we propose a‌ lightweight speech-guided frame selection strategy based on speech‌ transcript content, and experimentally demonstrate remarkable advantages. We‌ train the LLM to output timestamps for the‌ chapter boundaries, as well as free-form chapter titles.‌ This simple yet powerful approach scales to processing‌ one-hour long videos in a single forward pass.‌ Our results demonstrate substantial improvements (e.g., 45.3 vs‌ 26.7 F1 score) over the state of the‌ art on the recent VidChapters-7M benchmark. To promote‌ further research, we release our code and models‌ at our project page.

8.1.5 Online 3D Scene‌ Reconstruction Using Neural Object Priors

Participants: Thomas Chabal, Shizhe Chen,‌ Jean Ponce, Cordelia‌ Schmid.

This paper‌‌ 23 addresses the problem of reconstructing a scene‌ online at the level‌ of objects given an‌‌ RGB-D video sequence. While current object-aware neural implicit‌ representations hold promise, they‌ are limited in online‌‌ reconstruction efficiency and shape completion. Our main contributions‌ to alleviate the above‌ limitations are twofold. First,‌‌ we propose a feature grid interpolation mechanism to‌ continuously update grid-based object-centric‌ neural implicit representations as‌‌ new object parts are revealed. Second, we construct‌ an object library with‌ previously mapped objects in‌‌ advance and leverage the corresponding shape priors to‌ initialize geometric object models‌ in new videos, subsequently‌‌ completing them with novel views as well as‌ synthesized past views to‌ avoid losing original object‌‌ details. Extensive experiments on synthetic environments from the‌ Replica dataset, real-world ScanNet‌ sequences and videos captured‌‌ in our laboratory demonstrate that our approach outperforms‌ state-of-the-art neural implicit models‌ for this task in‌‌ terms of reconstruction accuracy and completeness.

Figure 4‌: Our method reconstructs‌‌ scenes at the level of objects from RGB-D‌ videos on the fly.‌ We leverage 3D shape‌‌ priors from a pre-computed object library to enhance‌ accuracy and completeness of‌ geometry reconstruction for individual‌‌ objects.

8.1.6 Detecting Looted Archaeological Sites from Satellite‌ Image Time Series

Participants:‌ Elliot Vincent, Mehraïl‌‌ Saroufim, Jonathan Chemla, Yves Ubelmann,‌ Philippe Marquis, Jean‌ Ponce, Mathieu Aubry‌‌.

Archaeological sites are the physical remains of‌ past human activity and‌ one of the main‌‌ sources of information about past societies and cultures.‌ However, they are also‌ the target of malevolent‌‌ human actions, especially in countries having experienced inner‌ turmoil and conflicts. Because‌ monitoring these sites from‌‌ space is a key step towards their preservation,‌ we introduce the DAFA‌ Looted Sites dataset 32‌‌, a labeled multi-temporal remote sensing dataset containing‌ 55,480 images acquired monthly‌ over 8 years across‌‌ 675 Afghan archaeological sites, including 135 sites looted‌ during the acquisition period.‌ It is particularly challenging‌‌ because of the limited number of training samples,‌ the class imbalance, the‌ weak binary annotations only‌‌ available at the level of the time series,‌ and the subtlety of‌ relevant changes coupled with‌‌ important irrelevant ones over a long time period.‌ It is also an‌ interesting playground to assess‌‌ the performance of satellite image time series (SITS)‌ classification methods on a‌ real and important use‌‌ case. We evaluate a large set of baselines,‌ outline the substantial benefits‌ of using foundation models‌‌ and show the additional boost that can be‌ provided by using complete‌ time series instead of‌‌ using a single image.

8.1.7 Towards Zero-Shot Multimodal‌ Machine Translation

Participants: Matthieu‌ Futeral, Cordelia Schmid‌‌, Benoît Sagot,‌ Rachel Bawden.

Current multimodal machine translation (MMT)‌ systems rely on fully supervised data (i.e models‌ are trained on sentences with their translations and‌ accompanying images). However, this type of data is‌ costly to collect, limiting the extension of MMT‌ to other language pairs for which such data‌ does not exist. In this work 24,‌ we propose a method to bypass the need‌ for fully supervised data to train MMT systems,‌ using multimodal English data only. Our method, called‌ ZeroMMT, consists in adapting a strong text-only machine‌ translation (MT) model by training it on a‌ mixture of two objectives: visually conditioned masked language‌ modelling and the Kullback-Leibler divergence between the original‌ and new MMT outputs. We evaluate on standard‌ MMT benchmarks and the recently released CoMMuTE, a‌ contrastive benchmark aiming to evaluate how well models‌ use images to disambiguate English sentences. We obtain‌ disambiguation performance close to state-of-the-art MMT models trained‌ additionally on fully supervised examples. To prove that‌ our method generalizes to languages with no fully‌ supervised training data available, we extend the CoMMuTE‌ evaluation dataset to three new languages: Arabic, Russian‌ and Chinese. We further show that we can‌ control the trade-off between disambiguation capabilities and translation‌ fidelity at inference time using classifier-free guidance and‌ without any additional data. Our code, data and‌ trained models are publicly accessible.

8.1.8 mOSCAR: A‌ Large-scale Multilingual and Multimodal Document-level Corpus

Participants: Matthieu‌ Futeral, Armel Zebaze, Pedro Ortiz Suarez‌, Julien Abadji, Rémi Lacroix, Cordelia‌ Schmid, Rachel Bawden, Benoît Sagot.‌

Multimodal Large Language Models (mLLMs) are trained on‌ a large amount of text-image data. While most‌ mLLMs are trained on caption-like data only, Alayrac‌ et al. [2022] showed that additionally training them‌ on interleaved sequences of text and images can‌ lead to the emergence of in-context learning capabilities.‌ However, the dataset they used, M3W, is not‌ public and is only in English. There have‌ been attempts to reproduce their results but the‌ released datasets are English-only. In contrast, current multilingual‌ and multimodal datasets are either composed of caption-like‌ only or medium-scale or fully private data. This‌ limits mLLM research for the 7,000 other languages‌ spoken in the world. We therefore introduce mOSCAR‌ 25, to the best of our knowledge‌ the first large-scale multilingual and multimodal document corpus‌ crawled from the web. It covers 163 languages,‌ 315M documents, 214B tokens and 1.2B images. We‌ carefully conduct a set of filtering and evaluation‌ steps to make sure mOSCAR is sufficiently safe,‌ diverse and of good quality. We additionally train‌ two types of multilingual model to prove the‌ benefits of mOSCAR: (1) a model trained on‌ a subset of mOSCAR and captioning data and‌ (2) a model train on captioning data only.‌ The model additionally trained on mOSCAR shows a‌ strong boost in few-shot learning performance across various‌ multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.

Figure 5‌‌: Example of a French document from mOSCAR.‌

8.1.9 Dual Perspectives on‌ Non-Contrastive Self-Supervised Learning

Participants:‌‌ Jean Ponce, Basile Terver, Martial Hebert‌, Michael Arbel.‌

The stop gradient and‌‌ exponential moving average iterative procedures are commonly used‌ in non-contrastive approaches to‌ self-supervised learning to avoid‌‌ representation collapse, with excellent performance in downstream applications‌ in practice. This presentation‌ investigates these procedures from‌‌ the dual viewpoints of optimization and dynamical systems.‌ We show that, in‌ general, although they do‌‌ not optimize the original objective, or any other‌ smooth function, they do‌ avoid collapse. Following Tian‌‌ et al. 2021, but without any of the‌ extra assumptions used in‌ their proofs, we then‌‌ show using a dynamical system perspective that, in‌ the linear case, minimizing‌ the original objective function‌‌ without the use of a stop gradient or‌ exponential moving average always‌ leads to collapse. Conversely,‌‌ we characterize explicitly the equilibria of the dynamical‌ systems associated with these‌ two procedures in this‌‌ linear setting as algebraic varieties in their parameter‌ space, and show that‌ they are, in general,‌‌ asymptotically stable. Our theoretical findings 53 are‌ illustrated by empirical experiments‌ with real and synthetic‌‌ data.

8.1.10 Optimal transport unlocks end-to-end learning for‌ single-molecule localization

Participants: Romain‌ Seailles, Jean-Baptiste Masson‌‌, Jean Ponce, Julien Mairal.

Single-molecule‌ localization microscopy (SMLM) allows‌ reconstructing biology-relevant structures beyond‌‌ the diffraction limit by detecting and localizing individual‌ fluorophores — fluorescent molecules‌ stained onto the observed‌‌ specimen — over time to reconstruct super-resolved images.‌ Currently, efficient SMLM requires‌ non-overlapping emitting fluorophores, leading‌‌ to long acquisition times that hinders live-cell imaging.‌ Recent deep-learning approaches can‌ handle denser emissions, but‌‌ they rely on variants of non-maximum suppression (NMS)‌ layers, which are unfortunately‌ non-differentiable and may discard‌‌ true positives with their local fusion strategy. In‌ this presentation 36,‌ we reformulate the SMLM‌‌ training objective as a set-matching problem, deriving an‌ optimal-transport loss that eliminates‌ the need for NMS‌‌ during inference and enables end-to-end training. Additionally, we‌ propose an iterative neural‌ network that integrates knowledge‌‌ of the microscope's optical system inside our model.‌ Experiments on synthetic benchmarks‌ and real biological data‌‌ show that both our new loss function and‌ architecture surpass the state‌ of the art at‌‌ moderate and high emitter densities. Code is available‌ at here.

8.2‌ Learning embodied representations

8.2.1‌‌ NextBestPath: Efficient 3D Mapping of Unseen Environments

Participants:‌ Shiyao Li, Antoine‌ Guédon, Clémentin Boittiaux‌‌, Shizhe Chen, Vincent Lepetit.

This‌ paper 27 addresses the‌ problem of active 3D‌‌ mapping, where an agent must find an efficient‌ trajectory to exhaustively reconstruct‌ a new scene. Previous‌‌ approaches mainly predict the next best view near‌ the agent's location, which‌ is prone to getting‌‌ stuck in local areas. Additionally, existing indoor datasets‌ are insufficient due to‌ limited geometric complexity and‌‌ inaccurate ground truth meshes.‌ To overcome these limitations, we introduce a novel‌ dataset AiMDoom with a map generator for the‌ Doom video game, enabling to better benchmark active‌ 3D mapping in diverse indoor environments. Moreover, we‌ propose a new method we call next-best-path (NBP),‌ which predicts long-term goals rather than focusing solely‌ on short-sighted views. The model jointly predicts accumulated‌ surface coverage gains for long-term goals and obstacle‌ maps, allowing it to efficiently plan optimal paths‌ with a unified model. By leveraging online data‌ collection, data augmentation and curriculum learning, NBP significantly‌ outperforms state-of-the-art methods on both the existing MP3D‌ dataset and our AiMDoom dataset, achieving more efficient‌ mapping in indoor environments of varying complexity.

Figure 6: Overview of the proposed next-best-path‌ (NBP) framework. The model (left) predicts a value‌ map of coverage gain and an obstacle map,‌ which are used for decision making (right) to‌ obtain a next-best path.

8.2.2 FOM-Nav: Frontier-Object Maps‌ for Object Goal Navigation

Participants: Thomas Chabal,‌ Shizhe Chen, Jean Ponce, Cordelia Schmid‌.

This paper 42 addresses the Object Goal‌ Navigation problem, where a robot must efficiently find‌ a target object in an unknown environment. Existing‌ implicit memory-based methods struggle with long-term memory retention‌ and planning, while explicit map-based approaches lack rich‌ semantic information. To address these challenges, we propose‌ FOM-Nav, a modular framework that enhances exploration efficiency‌ through Frontier-Object Maps and vision-language models. Our Frontier-Object‌ Maps are built online and jointly encode spatial‌ frontiers and fine-grained object information. Using this representation,‌ a vision-language model performs multimodal scene understanding and‌ high-level goal prediction, which is executed by a‌ low-level planner for efficient trajectory generation. To train‌ FOM-Nav, we automatically construct large-scale navigation datasets from‌ real-world scanned environments. Extensive experiments validate the effectiveness‌ of our model design and constructed dataset. FOM-Nav‌ achieves state-ofthe-art performance on the MP3D and HM3D‌ benchmarks, particularly in navigation efficiency metric SPL, and‌ yields promising results on a real robot.

Figure 7: The‌ proposed frontier-object map is a rich representation of‌ objects and frontiers (boundaries of the explored scene),‌ displayed here as colored point clouds and red‌ lines. It encodes geometric, distance and visual/textual information‌ for frontiers and objects.

8.2.3 Gondola: Grounded Vision‌ Language Planning for Generalizable Robotic Manipulation

Participants: Shizhe‌ Chen, Ricardo Garcia, Paul Pacaud,‌ Cordelia Schmid.

Robotic manipulation faces a significant‌ challenge in generalizing across unseen objects, environments and‌ tasks specified by diverse language instructions. To improve‌ generalization capabilities, recent research has incorporated large language models (LLMs) for planning‌ and action execution. While‌ promising, these methods often‌‌ fall short in generating grounded plans in visual‌ environments. Although efforts have‌ been made to perform‌‌ visual instructional tuning on LLMs for robotic manipulation,‌ existing methods are typically‌ constrained by single-view image‌‌ input and struggle with precise object grounding. In‌ this work 43,‌ we introduce Gondola, a‌‌ novel grounded vision-language planning model based on LLMs‌ for generalizable robotic manipulation.‌ Gondola takes multi-view images‌‌ and history plans to produce the next action‌ plan with interleaved texts‌ and segmentation masks of‌‌ target objects and locations. To support the training‌ of Gondola, we construct‌ three types of datasets‌‌ using the RLBench simulator, namely robot grounded planning,‌ multi-view referring expression and‌ pseudo long-horizon task datasets.‌‌ Gondola outperforms the state-of-the-art LLM-based method across all‌ four generalization levels of‌ the GemBench dataset, including‌‌ novel placements, rigid objects, articulated objects and long-horizon‌ tasks.

Figure 8:‌ Gondola leverages multi-view images‌‌ for 3D scene perception and segmentation masks to‌ provide precisely grounded plans.‌

8.2.4 Collision avoidance from‌‌ monocular vision trained with novel view synthesis

Participants:‌ Valentin Tordjman--Levavasseur, Stéphane‌ Caron.

Collision avoidance‌‌ can be checked in explicit environment models such‌ as elevation maps or‌ occupancy grids, yet integrating‌‌ such models with a locomotion policy requires accurate‌ state estimation. In 58‌, we consider the‌‌ question of collision avoidance from an implicit environment‌ model. We use monocular‌ RGB images as inputs‌‌ and train a collisionavoidance policy from photorealistic images‌ generated by 2D Gaussian‌ splatting. We evaluate the‌‌ resulting pipeline in realworld experiments under velocity commands‌ that bring the robot‌ on an intercept course‌‌ with obstacles. Our results suggest that RGB images‌ can be enough to‌ make collision-avoidance decisions, both‌‌ in the room where training data was collected‌ and in out-of-distribution environments.‌

Figure 9: Effect‌ of the vision-based collision-avoidance‌ policy when the commanded‌‌ velocity prompts the robot to collide with a‌ wall. Blue: joystick user‌ input, kept stationary at‌‌ full forward throttle. Green: trajectory actually followed by‌ the robot after compensation‌ by the policy.

8.2.5‌‌ KernelSOS for Global Sampling-Based Optimal Control and Estimation‌ via Semidefinite Programming

Participants:‌ Antoine Groudiev, Fabian‌‌ Schramm, Eloïse Berthier, Justin Carpentier,‌ Frederike Dümbgen.

Global‌ optimization has gained attraction‌‌ over the past decades, thanks to the development‌ of both theoretical foundations‌ and efficient numerical routines‌‌ to cope with optimization problems of various complexities.‌ Among recent methods, Kernel‌ Sum of Squares (KernelSOS)‌‌ appears as a powerful framework, leveraging the potential‌ of sum of squares‌ methods from the polynomial‌‌ optimization community with the‌ expressivity of kernel methods widely used in machine‌ learning. This paper 46 applies the kernel sum‌ of squares framework for solving control and estimation‌ problems, which exhibit poor local minima. We demonstrate‌ that KernelSOS performs well on a selection of‌ problems from both domains. In particular, we show‌ that KernelSOS is competitive with other sum of‌ squares approaches on estimation problems, while being applicable‌ to non-polynomial and non-parametric formulations. The samplebased nature‌ of KernelSOS allows us to apply it to‌ trajectory optimization problems with an integrated simulator treated‌ as a black box, both as a standalone‌ method and as a powerful initialization method for‌ local solvers, facilitating the discovery of better solutions.‌

8.2.6 Sobolev Diffusion Policy

Participants: Theotime Le Hellard‌, Franki Nguimatsia Tiofack, Quentin Le Lidec‌, Justin Carpentier.

This paper 48 introduces‌ a novel framework to combine the strengths of‌ policy learning and trajectory optimization effectively. On the‌ one hand, it builds upon diffusion policy, an‌ expressive imitation learning method based on diffusion probabilistic‌ generative models. On the other hand, it uses‌ gradient-based trajectory optimization solvers to generate locally optimal‌ trajectories and leverage their associated feedback gains, doing‌ Sobolev training with first-order information. Combining both, we‌ introduce a first-order loss for diffusion-based policies. The‌ framework alternates between collecting trajectories using a solver‌ warm-started by the policy and training. Through comprehensive‌ experiments, we demonstrate how the Sobolev component significantly‌ reduces the number of trajectories required for the‌ policy to converge globally. First-order information both avoids‌ overfitting, despite using very few samples, and mitigates‌ the compounding error issue of imitation-based policies, even‌ when predicting torques for tasks requiring high-frequency control.‌ We benchmark the benefits of SDP on various‌ robotics tasks of increasing complexity. In particular, SDP‌ shows to be stable over extended horizons, with‌ fewer diffusion steps, shrinking the overall rollout time‌ compared to vanilla diffusion models. And when used‌ to compute initial guesses for trajectory optimization, it‌ reduces the solving time by a factor of‌ 2 to 20.

Figure 10: A task involving to‌ move the arm's end-effector from the blue sphere‌ to the red one. The pink trajectory is‌ obtained by a trajectory optimization (TO) solver alone,‌ the orange one by our SDP method, and‌ the gray one is the SDP trajectory refined‌ by the solver. SDP finds more direct trajectories,‌ while the TO solver may be stuck in‌ local minima.

8.2.7 First-order Sobolev Reinforcement Learning

Participants:‌ Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier‌.

This paper 55 proposes a refinement of temporal-difference learning that enforces‌ first-order Bellman consistency: the‌ learned value function is‌‌ trained to match not only the Bellman targets‌ in value but also‌ their derivatives with respect‌‌ to states and actions. By differentiating the Bellman‌ backup through differentiable dynamics,‌ we obtain analytically consistent‌‌ gradient targets. Incorporating these into the critic objective‌ using a Sobolev-type loss‌ encourages the critic to‌‌ align with both the value and local geometry‌ of the target function.‌ This first-order TD matching‌‌ principle can be seamlessly integrated into existing algorithms,‌ such as Q-learning or‌ actor-critic methods (e.g., DDPG,‌‌ SAC), potentially leading to faster critic convergence and‌ more stable policy gradients‌ without altering their overall‌‌ structure.

Figure‌ 11: Comparison of‌ $Q$ -function slices: ground-truth‌‌ (black), Sobolev Q-learning (blue), and dashed default Q-learning‌ (red) after 200 and‌ 400 training step for‌‌ two different states $s = 0 . 0‌$ and $s = 0‌ . 5$ .

8.2.8‌‌ Control of Humanoid Robots with Parallel Mechanisms using‌ Differential Actuation Models

Participants:‌ Victor Lutz, Ludovic‌‌ de Matteis, Virgile Batto, Nicolas Mansard‌.

Several recently released‌ humanoid robots, inspired by‌‌ the mechanical design of Cassie, employ actuator configurations‌ in which the motors‌ are displaced from the‌‌ joints to reduce leg inertia. While studies accounting‌ for the full kinematic‌ complexity have demonstrated the‌‌ benefits of these designs, the associated loop-closure constraints‌ greatly increase computational cost‌ and limit their use‌‌ in control and learning. As a result, the‌ non-linear transmission is often‌ approximated by a constant‌‌ reduction ratio, preventing exploitation of the mechanism’s full‌ capabilities. This paper 50‌ introduces a compact analytical‌‌ formulation for the two standard knee and ankle‌ mechanisms that captures the‌ exact non-linear transmission while‌‌ remaining computationally efficient. The model is fully differentiable‌ up to second order‌ with a minimal formulation,‌‌ enabling low-cost evaluation of dynamic derivatives for trajectory‌ optimization and of the‌ apparent transmission impedance for‌‌ reinforcement learning. We integrate this formulation into trajectory‌ optimization and locomotion policy‌ learning, and compare it‌‌ against simplified constant-ratio approaches. Hardware experiments demonstrate improved‌ accuracy and robustness, showing‌ that the proposed method‌‌ provides a practical means to incorporate parallel actuation‌ into modern control algorithms.‌

8.2.9 On the Conic‌‌ Complementarity of Planar Contacts

Participants: Yann de Mont-Marin‌, Louis Montaut,‌ Jean Ponce, Martial‌‌ Hebert, Justin Carpentier.

We present a‌ unifying theoretical result 30‌ that connects two foundational‌‌ principles in robotics: the Signorini law for point‌ contacts, which underpins many‌ simulation methods for preventing‌‌ object interpenetration, and the center of pressure (also‌ known as the zero-moment‌ point), a key concept‌‌ used in, for instance, optimization-based locomotion control. Our‌ contribution is the planar‌ Signorini condition, a conic‌‌ complementarity formulation that models general planar contacts between‌ rigid bodies. We prove‌ that this formulation is‌‌ equivalent to enforcing the punctual Signorini law across‌ an entire contact surface,‌ thereby bridging the gap‌‌ between discrete and continuous‌ contact models. A geometric interpretation reveals that the‌ framework naturally captures three physical regimes -sticking, separating,‌ and tilting-within a unified complementarity structure. This leads‌ to a principled extension of the classical center‌ of pressure, which we refer to as the‌ extended center of pressure. By establishing this connection,‌ our work provides a mathematically consistent and computationally‌ tractable foundation for handling planar contacts, with implications‌ for both the accurate simulation of contact dynamics‌ and the design of advanced control and optimization‌ algorithms in locomotion and manipulation.

8.2.10 Reference-Free Sampling-Based‌ Model Predictive Control

Participants: Fabian Schramm, Pierre‌ Fabre, Nicolas Perrin-Gilbert, Justin Carpentier.‌

This paper 35 presents a sampling-based model predictive‌ control (MPC) framework that enables emergent locomotion without‌ relying on handcrafted gait patterns or predefined contact‌ sequences. Our method discovers diverse motion patterns, ranging‌ from trotting to galloping, robust standing policies, jumping,‌ and handstand balancing, purely through the optimization of‌ high-level objectives. Building on model predictive path integral‌ (MPPI), we propose a dual-space spline parameterization that‌ operates on position and velocity control points. Our‌ approach enables contact-making and contact-breaking strategies that adapt‌ automatically to task requirements, requiring only a limited‌ number of sampled trajectories. This sample efficiency allows‌ us to achieve real-time control on standard CPU‌ hardware, eliminating the need for GPU acceleration typically‌ required by other state-of-the-art MPPI methods. We validate‌ our approach on the Go2 quadrupedal robot, demonstrating‌ various emergent gaits and basic jumping capabilities. In‌ simulation, we further showcase more complex behaviors, such‌ as backflips, dynamic handstand balancing and locomotion on‌ a Humanoid, all without requiring reference tracking or‌ offline pre-training.

Figure‌ 12: Overview of the framework showing the‌ dual-spline parametrization, noise schedule and reference-free costs.

8.2.11‌ Guided Flow Policy: Learning from High-Value Actions in‌ Offline Reinforcement Learning

Participants: Franki Nguimatsia Tiofack,‌ Théotime Le Hellard, Fabian Schramm, Nicolas‌ Perrin-Gilbert, Justin Carpentier.

Offline reinforcement learning‌ often relies on behavior regularization that enforces policies‌ to remain close to the dataset distribution. However,‌ such approaches fail to distinguish between high-value and‌ low-value actions in their regularization components. We introduce‌ Guided Flow Policy (GFP) 34, which couples‌ a multistep flow-matching policy with a distilled one-step‌ actor. The actor directs the flow policy through‌ weighted behavior cloning to focus on cloning high-value‌ actions from the dataset rather than indiscriminately imitating‌ all state-action pairs. In turn, the flow policy‌ constrains the actor to remain aligned with the‌ dataset's best transitions while maximizing the critic. This‌ mutual guidance enables GFP to achieve state-of-the-art performance‌ across 144 state and pixel-based tasks from the‌ OGBench, Minari, and D4RL benchmarks, with substantial gains‌ on suboptimal datasets and challenging tasks.

Figure 13:‌ Overview of the Guided Flow Policy framework. GFP‌ consists of three main components: (i) in yellow, VaBC, a multi-step flow‌ policy $π_{ω}$ trained‌ via weighted BC using‌‌ the guidance term ${g}_{η}$ , (ii) in‌ green, a one-step actor‌ $π_{θ}$ distilled from‌‌ the flow policy, and (iii) in gray, a‌ critic $Q_{ϕ}$ guiding‌ action evaluation. $π_{ω‌‌}$ regularizes the actor toward high-value actions from the‌ dataset; in turn, the‌ actor shapes the flow‌‌ and optimizes the critic following the actor–critic approach.‌ The different components of‌ the figure are introduced‌‌ throughout the paper. Each drawing represents the probability‌ distribution of actions $a‌ \in 𝒜$ of a‌‌ policy, in a current state $s$ , except‌ for the gray ones,‌ where it is the‌‌ value of actions $a \in 𝒜$ in state‌ $s$ , according to‌ the critic.

8.2.12 Contact-Implicit‌‌ Inverse Dynamics

Participants: Etienne Ménager, Pierre Fabre‌, Antoinre Bambade,‌ Wilson Jallet, Alberto‌‌ De Marchi, Justin Carpentier.

Task-space inverse‌ dynamics, also known as‌ operational space control, is‌‌ a popular control paradigm for controlling robots in‌ real-time. It enables the‌ control or stabilization of‌‌ robot dynamics around reference trajectories while accounting for‌ under-actuation, actuator limits, and‌ contact interactions. Over the‌‌ past few decades, this versatile control paradigm has‌ been successfully deployed in‌ numerous robotics settings, ranging‌‌ from quadrupeds and humanoid robots to deformable robots,‌ in scenarios involving rich‌ physical contact interactions between‌‌ a robot and its environment. In practice, contact-aware‌ inverse dynamics controllers assume‌ that contact sequences are‌‌ known in advance, typically provided by a higher-level‌ contact planner, which inherently‌ limits their ability to‌‌ select among breaking, sliding, or sticking contacts automatically.‌

In this paper 51‌, we extend the‌‌ control formalism of task-space inverse dynamics, which is‌ classically formulated as a‌ quadratic program, to a‌‌ more general quadratic program with complementarity constraints (QPCC).‌ This formulation fully accounts‌ for actuator limits and‌‌ frictional contacts, modeled as nonlinear complementary constraints. To‌ solve these QPCC problems,‌ we draw inspiration from‌‌ the alternating direction method of multipliers to devise‌ an iterative optimization approach‌ that alternates between minimizing‌‌ a smooth convex function that accounts for task‌ objectives and system dynamics,‌ and projecting over convex‌‌ and non-convex sets that capture actuator and complementary‌ frictional contact constraints. By‌ notably handling complementary frictional‌‌ contact constraints through projection, our approach enables us‌ to implicitly and automatically‌ reason about the optimal‌‌ contact modes that fulfill the task objectives and‌ constraints. We have implemented‌ our QPCC solver in‌‌ C++ for efficiency, and demonstrate its usability and‌ versatility on rigid and‌ soft robots across various‌‌ control scenarios, ranging from the control of actuated‌ box sliding on the‌ grounds, to control balance‌‌ of legged robots that automatically break and create‌ contacts (e.g., jumping tasks,‌ balancing tasks) or control‌‌ of deformable robots interacting with their environment.

Figure‌ 14: The proposed‌ method generically handles inverse‌‌ dynamics with frictional contact‌ for both rigid and soft robots.Standing humanoid‌ (left). The humanoid tracks an unstable reference configuration‌ (standing on its left leg). The only cost‌ terms are the reference tracking cost and regulating‌ the angular momentum around zero. Deformable robot (right).‌ The robot controls a ball (in blue) by‌ deforming its body. The task is formulated as‌ minimizing the distance between a reference position and‌ the ball's center, and solved using the robot-ball‌ coupling via frictional contacts.

8.3 A Data-driven Contact‌ Estimation Method for Wheeled-Biped Robots

Participants: Ü. Bora‌ Gökbakan, Frederike Dümbgen, Stéphane Caron.‌

Contact estimation is a key ability for limbed‌ robots, where making and breaking contacts has a‌ direct impact on state estimation and balance control.‌ Existing approaches typically rely on gate-cycle priors or‌ designated contact sensors. We design a contact estimator‌ that is suitable for the emerging wheeled-biped robot‌ types that do not have these features. To‌ this end, we propose a Bayes filter 45‌ in which update steps are learned from real-robot‌ torque measurements while prediction steps rely on inertial‌ measurements. We evaluate this approach in extensive real-robot‌ and simulation experiments. Our method achieves better performance‌ while being considerably more sample efficient than a‌ comparable deep-learning baseline.

Figure 15:‌ Robustly detecting the moments when a wheeled-biped robot‌ makes and breaks contact is crucial for successful‌ estimation and control. This paper proposes a contact‌ estimator based only on inertial and torque measurements.‌ The measurements are fed into a novel Bayesian‌ filter formulation to robustly estimate the binary contact‌ state. We validate our results extensively both in‌ simulation and real-world experiments, as depicted in the‌ bottom figure.

8.3.1 End-to-End and Highly-Efficient Differentiable Simulation‌ for Robotics

Participants: Quentin Le Lidec, Louis‌ Montaut, Yann de Mont-Marin, Justin Carpentier‌.

Over the past few years, robotics simulators‌ have largely improved in efficiency and scalability, enabling‌ them to generate years of simulated data in‌ a few hours. Yet, efficiently and accurately computing‌ the simulation derivatives remains an open challenge, with‌ potentially high gains on the convergence speed of‌ reinforcement learning and trajectory optimization algorithms, especially for‌ problems involving physical contact interactions. This paper 49‌ contributes to this objective by introducing a unified‌ and efficient algorithmic solution for computing the analytical‌ derivatives of robotic simulators. The approach considers both‌ the collision and frictional stages, accounting for their‌ intrinsic nonsmoothness and also exploiting the sparsity induced‌ by the underlying multibody systems. These derivatives have‌ been implemented in C++, and the code will be open-sourced in the‌ Simple simulator. They depict‌ state-of-the-art timings ranging from‌‌ 5 us for a 7-dof manipulator up to‌ 95 us for 36-dof‌ humanoid, outperforming alternative solutions‌‌ by a factor of at least 100.

Figure 16: Illustration‌‌ of the sliding mode. $λ^{*}$ lives in‌ the boundary of the‌ cone $K_{μ}$ in‌‌ the direction opposite to $σ = σ_{T‌}$ and the variation $d‌ λ^{*}$ lies inside‌‌ the tangent plane.

8.3.2 Differentiable Simulation of Soft‌ Robots with Frictional Contacts‌

Participants: Etienne Ménager,‌‌ Louis Montaut, Quentin Le Lidec, Justin‌ Carpentier.

In recent‌ years, soft robotics simulators‌‌ have evolved to offer various functionalities, including the‌ simulation of different material‌ types (e.g., elastic, hyper-elastic)‌‌ and actuation methods (e.g., pneumatic, cable-driven, servomotor). These‌ simulators also provide tools‌ for various tasks, such‌‌ as calibration, design, and control. However, efficiently and‌ accurately computing derivatives within‌ these simulators remains a‌‌ challenge, particularly in the presence of physical contact‌ interactions. Incorporating these derivatives‌ can, for instance, significantly‌‌ improve the convergence speed of control methods like‌ reinforcement learning and trajectory‌ optimization, enable gradient-based techniques‌‌ for design, or facilitate end-to-end machine-learning approaches for‌ model reduction. This paper‌ 29 addresses these challenges‌‌ by introducing a unified method for computing the‌ derivatives of mechanical equations‌ within the finite element‌‌ method framework, including contact interactions modeled as a‌ nonlinear complementarity problem. The‌ proposed approach handles both‌‌ collision and friction phases, accounts for their nonsmooth‌ dynamics, and leverages the‌ sparsity introduced by mesh-based‌‌ models. Its effectiveness is demonstrated through several examples‌ of controlling and calibrating‌ soft systems.

8.3.3 Constrained‌‌ Articulated Body Algorithms for Closed-Loop Mechanisms

Participants: Ajay‌ Suresha Sathya, Justin‌ Carpentier.

Efficient rigid-body‌‌ dynamics algorithms are instrumental in enabling high-frequency dynamics‌ evaluation for resource-intensive applications‌ (e.g., model predictive control,‌‌ large-scale simulation, reinforcement learning), potentially on resource-constrained hardware.‌ Existing recursive algorithms with‌ low computational complexity are‌‌ mostly restricted to kinematic trees with external contact‌ constraints or are sensitive‌ to singular cases (e.g.,‌‌ linearly dependent constraints and kinematic singularities), severely impacting‌ their practical usage in‌ existing simulators. This article‌‌ 54 introduces two original lowcomplexity recursive algorithms, loop-constrained‌ articulated body algorithm (LCABA)‌ and proxBBO, based on‌‌ proximal dynamics formulation for forward simulation of mechanisms‌ with loops. These algorithms‌ are derived from first‌‌ principles using non-serial dynamic programming, depict linear complexity‌ in practical scenarios, and‌ are numerically robust to‌‌ singular cases. They extend the existing constrained articulated‌ body algorithm (constrainedABA) to‌ handle internal loops and‌‌ the pioneering BBO algorithm from the 1980s to‌ singular cases. Both algorithms‌ have been implemented by‌‌ leveraging the open-source Pinocchio library, benchmarked in detail,‌ and depict state-ofthe-art performance‌ for various robot topologies,‌‌ including over 6x speed-ups compared to existing non-recursive‌ algorithms for high degree-of-freedom‌ systems with internal loops‌‌ such as recent humanoid robots.

8.3.4 A Data-driven‌ Contact Estimation Method for‌ Wheeled-Biped Robots

Participants: Ü.‌‌ Bora Gökbakan, Frederike‌ Dümbgen, Stéphane Caron.

Contact estimation is‌ a key ability for limbed robots, where making‌ and breaking contacts has a direct impact on‌ state estimation and balance control. Existing approaches typically‌ rely on gate-cycle priors or designated contact sensors.‌ In this work 45, we design a‌ contact estimator that is suitable for the emerging‌ wheeled-biped robot types that do not have these‌ features. To this end, we propose a Bayes‌ filter in which update steps are learned from‌ real-robot torque measurements while prediction steps rely on‌ inertial measurements. We evaluate this approach in extensive‌ real-robot and simulation experiments. Our method achieves better‌ performance while being considerably more sample efficient than‌ a comparable deep-learning baseline.

Figure 17‌: Robustly detecting the moments when a wheeled-biped‌ robot makes and breaks contact is crucial for‌ successful estimation and control. This paper proposes a‌ contact estimator based only on inertial and torque‌ measurements. The measurements are fed into a novel‌ Bayesian filter formulation to robustly estimate the binary‌ contact state. We validate our results extensively both‌ in simulation and real-world experiments, as depicted in‌ the bottom figure.

8.3.5 Guardian: Detecting Robotic Planning‌ and Execution Errors with Vision-Language Models

Participants: Paul‌ Pacaud, Ricardo Garcia, Shizhe Chen,‌ Cordelia Schmid.

This paper 52 addresses the‌ problem of reliable failure detection and recovery in‌ Robotic Manipulation. Although current Vision-Language Models (VLMs) show‌ promise, their accuracy and generalization are limited by‌ the scarcity of failure data. To address this‌ data gap, we propose an automatic robot failure‌ synthesis approach that procedurally perturbs successful trajectories to‌ generate diverse planning and execution failures. This method‌ produces not only binary classification labels but also‌ fine-grained failure categories and step-by-step reasoning traces in‌ both simulation and the real world. With it,‌ we construct three new failure detection benchmarks: RLBench-Fail,‌ BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and‌ scale of existing failure datasets. We then train‌ Guardian, a VLM with multi-view images for detailed‌ failure reasoning and detection, as illustrated in Fig.‌ 18. Guardian achieves state-of-the-art performance on both‌ existing and newly introduced benchmarks. It also effectively‌ improves task success rates when integrated into a‌ state-of-the-art manipulation system in simulation and real robots,‌ demonstrating the impact of our generated failure data.‌

Figure 18: llustration of our‌ Guardian model - a VLM fine-tuned on our constructed failure datasets. It‌ detects planning failures (top)‌ and execution failures (bottom)‌‌ in robotic manipulation.

8.3.6 Augmented Lagrangian methods for‌ infeasible convex optimization problems‌ and diverging proximal-point algorithms‌‌

Participants: Roland Andrews, Justin Carpentier, Adrien‌ Taylor.

This work‌ investigates the convergence behavior‌‌ of augmented Lagrangian methods (ALMs) when applied to‌ convex optimization problems that‌ may be infeasible. ALMs‌‌ are a popular class of algorithms for solving‌ constrained optimization problems. We‌ establish progressively stronger convergence‌‌ results, ranging from basic sequence convergence to precise‌ convergence rates, under a‌ hierarchy of assumptions. In‌‌ particular, we demonstrate that, under mild assumptions, the‌ sequences of iterates generated‌ by ALMs converge to‌‌ solutions of the “closest feasible problem”.

This study‌ leverages the classical relationship‌ between ALMs and the‌‌ proximal-point algorithm applied to the dual problem. A‌ key technical contribution is‌ a set of concise‌‌ results on the behavior of the proximal-point algorithm‌ when applied to functions‌ that may not have‌‌ minimizers. These results pertain to its convergence in‌ terms of its subgradients‌ and of the values‌‌ of the convex conjugate.

8.3.7 Certifiably optimal rotation‌ and pose estimation based‌ on the Cayley map‌‌

Participants: Timothy D Barfoot, Connor Holmes,‌ Frederike Dümbgen.

We‌ present novel, convex relaxations‌‌ for rotation and pose estimation problems that can‌ a posteriori guarantee global‌ optimality for practical measurement‌‌ noise levels. Some such relaxations exist in the‌ literature for specific problem‌ setups that assume the‌‌ matrix von Mises-Fisher distribution (a.k.a., matrix Langevin distribution‌ or chordal distance) for‌ isotropic rotational uncertainty. However,‌‌ another common way to represent uncertainty for rotations‌ and poses is to‌ define anisotropic noise in‌‌ the associated Lie algebra. Starting from a noise‌ model based on the‌ Cayley map, we define‌‌ our estimation problems, convert them to Quadratically Constrained‌ Quadratic Programs (QCQPs), then‌ relax them to Semidefinite‌‌ Programs (SDPs), which can be solved using standard‌ interior-point optimization methods; global‌ optimality follows from Lagrangian‌‌ strong duality. We first show how to carry‌ out basic rotation and‌ pose averaging. We then‌‌ turn to the more complex problem of trajectory‌ estimation, which involves many‌ pose variables with both‌‌ individual and inter-pose measurements (or motion priors). Our‌ contribution 12 is to‌ formulate SDP relaxations for‌‌ all these problems based on the Cayley map‌ (including the identification of‌ redundant constraints) and to‌‌ show them working in practical settings. We hope‌ our results can add‌ to the catalogue of‌‌ useful estimation problems whose solutions can be a‌ posteriori guaranteed to be‌ globally optimal.

8.3.8 Human-Robot‌‌ Co-Simulation method for upper limb assistive force calculation‌ using polytopes

Participants: Maël‌ Gallois, Mégane Millan‌‌, Nicolas Vignais, Sylvain Guégan, Marie‌ Babel, Justin Carpentier‌, Charles Pontonnier.‌‌

Assisting the upper limb constitutes a significant challenge‌ in the rehabilitation and‌ readaptation of individuals with‌‌ neuromuscular and/or neurodegenerative disorders. To address this issue,‌ robotic devices such as‌ exoskeletons have been designed.‌‌ However, the control of‌ these devices remains intricate and challenging, particularly in‌ the context of the upper limb. The objective‌ of this study 13 is to define a‌ method to compute the assistance an exoskeleton should‌ provide to the user, according to its capabilities.‌ With the objective to minimize the user voluntary‌ effort, optimal assisting forces that minimize the torque‌ the user must exert along a movement while‌ maximizing the forces provided at the user interfaces‌ are computed. Polytopes (Skiric 2023) are used to‌ determine feasible sets.

8.3.9 PROXQP: an Efficient and‌ Versatile Quadratic Programming Solver for Real-Time Robotics Applications‌ and Beyond

Participants: Antoine Bambade, Fabian Schramm‌, Sarah El Kazdadi, Stéphane Caron,‌ Adrien Taylor, Justin Carpentier.

Convex Quadratic‌ programming (QP) has become a core component in‌ the modern engineering toolkit, particularly in robotics, where‌ QP problems are legions, ranging from real-time whole-body‌ controllers to planning and estimation algorithms. Many of‌ those QPs need to be solved at high‌ frequency. Meeting timing requirements requires taking advantage of‌ as many structural properties as possible for the‌ problem at hand. For instance, it is generally‌ crucial to resort to warm-starting to exploit the‌ resemblance of consecutive control iterations. While a large‌ range of off-the-shelf QP solvers is available, only‌ a few are suited to exploit problem structure‌ and warm-starting capacities adequately. In this work 11‌, we propose the PROXQP algorithm, a new‌ and efficient QP solver that exploits QP structures‌ by leveraging primal-dual augmented Lagrangian techniques. For convex‌ QPs, PROXQP features a global convergence guarantee to‌ the closest feasible QP, an essential property for‌ safe closedloop control. We illustrate its practical performance‌ on various standard robotic and control experiments, including‌ a real-world closed-loop model predictive control application. While‌ originally tailored for robotics applications, we show that‌ PROXQP also performs at the level of state‌ of the art on generic QP problems, making‌ PROXQP suitable for use as an off-the-shelf solver‌ for regular applications beyond robotics.

8.3.10 Optimal Control‌ of Walkers with Parallel Actuation

Participants: Ludovic de‌ Matteis, Virgile Batto, Justin Carpentier,‌ Nicolas Mansard.

Legged robots with complex kinematic‌ architectures, such as parallel linkages, offer significant advancements‌ in mobility and efficiency. However, generating versatile movements‌ for these robots requires accurate dynamic modeling that‌ reflects their specific mechanical structures. Previous approaches often‌ relied on simplified models, resulting in sub-optimal control,‌ particularly in tasks requiring the full actuator range.‌ Here 28, we present a method that‌ fully models the dynamics of legged robots with‌ parallel linkages, formulating their motion generation as an‌ optimal control problem with specific contact dynamics. We‌ introduce 6D kinematic closure constraints and derive their‌ analytical derivatives, enabling the solver to exploit nonlinear‌ transmission and the consequent variable actuator reduction. This‌ approach reduces peak motor torques and expands the‌ usable range of actuator motion and force. We‌ empirically demonstrate that fully modeling the kinematics leads to superior performance, especially‌ in demanding tasks such‌ as fast walking and‌‌ stair climbing. Beyond serialparallel designs, our method also‌ addresses motion generation for‌ fully-parallel walkers.

8.3.11 Extended‌‌ URDF: Accounting for parallel mechanism in robot description‌

Participants: Virgile Batto,‌ Ludovic de Matteis,‌‌ Nicolas Mansard.

Robotic designs played an important‌ role in recent advances‌ by providing powerful robots‌‌ with complex mechanics. Many recent systems rely on‌ parallel actuation to provide‌ lighter limbs and allow‌‌ more complex motion. However, these emerging architectures fall‌ outside the scope of‌ most used description formats,‌‌ leading to difficulties when designing, storing, and sharing‌ the models of these‌ systems. This paper 18‌‌ introduces an extension to the widely used Unified‌ Robot Description Format (URDF)‌ to support closed-loop kinematic‌‌ structures. Our approach relies on augmenting URDF with‌ minimal additional information to‌ allow more efficient modeling‌‌ of complex robotic systems while maintaining compatibility with‌ existing design and simulation‌ frameworks. This method sets‌‌ the basic requirement for a description format to‌ handle parallel mechanisms efficiently.‌ We demonstrate the applicability‌‌ of our approach by providing an open-source collection‌ of parallel robots, along‌ with tools for generating‌‌ and parsing this extended description format. The proposed‌ extension simplifies robot modeling,‌ reduces redundancy, and improves‌‌ usability for advanced robotic applications.

8.3.12 PROXDDP: Proximal‌ Constrained Trajectory Optimization

Participants:‌ Wilson Jallet, Antoine‌‌ Bambade, Etienne Arlaud, Sarah El Kazdadi‌, Nicolas Mansard,‌ Justin Carpentier.

Trajectory‌‌ optimization has been a popular choice for motion‌ generation and control in‌ robotics for at least‌‌ a decade. Several numerical approaches have exhibited the‌ required speed to enable‌ online computation of trajectories‌‌ for real-time of various systems, including complex robots.‌ Many of these said‌ are based on the‌‌ differential dynamic programming (DDP) algorithm – initially designed‌ for unconstrained trajectory optimization‌ problems – and its‌‌ variants, which are relatively easy to implement and‌ provide good runtime performance.‌ However, several problems in‌‌ robot control call for using constrained formulations (e.g.‌ torque limits, obstacle avoidance),‌ from which several difficulties‌‌ arise when trying to adapt DDP-type methods: numerical‌ stability, computational efficiency, and‌ constraint satisfaction.In this article‌‌ 14, we leverage proximal methods for constrained‌ optimization and introduce a‌ DDP-type method for fast,‌‌ constrained trajectory optimization suited for model-predictive control (MPC)‌ applications with easy warm-starting.Compared‌ to earlier solvers, our‌‌ approach effectively manages hard constraints without warm-start limitations‌ and exhibits good convergence‌ behavior. We provide a‌‌ complete implementation as part of an open-source and‌ flexible C++ trajectory optimization‌ library called ALIGATOR. These‌‌ algorithmic contributions are validated through several trajectory planning‌ scenarios from the robotics‌ literature and the real-time‌‌ whole-body MPC of a quadruped robot.

8.3.13 Structure-Exploiting‌ Sequential Quadratic Programming for‌ Model-Predictive Control

Participants: Armand‌‌ Jordana, Sébastien Kleff, Avadesh Meduri,‌ Justin Carpentier, Nicolas‌ Mansard, Ludovic Righetti‌‌.

The promise of model-predictive control in robotics‌ has led to extensive‌ development of efficient numerical‌‌ optimal control solvers in‌ line with differential dynamic programming because it exploits‌ the sparsity induced by time. In this work‌ 15, we argue that this effervescence has‌ hidden the fact that sparsity can be equally‌ exploited by standard nonlinear optimization. In particular, we‌ show how a tailored implementation of sequential quadratic‌ programming achieves state-of-the-art model-predictive control. Then, we clarify‌ the connections between popular algorithms from the robotics‌ community and well-established optimization techniques. Further, the sequential‌ quadratic program formulation naturally encompasses the constrained case,‌ a notoriously difficult problem in the robotics community.‌ Specifically, we show that it only requires a‌ sparsity-exploiting implementation of a state-of-the-art quadratic programming solver.‌ We illustrate the validity of this approach in‌ a comparative study and experiments on a torque-controlled‌ manipulator. To the best of our knowledge, this‌ is the first demonstration of closed loop nonlinear‌ model-predictive control with constraints on a real robot.‌

8.3.14 Modeling, Embedded Control and Design of Soft‌ Robots using a Learned Condensed FEM Model

Participants:‌ Tanguy Navez, Etienne Ménager, Paul Chaillou‌, Olivier Goury, Alexandre Kruszewski, Christian‌ Duriez.

The Finite Element Method (FEM) is‌ a powerful modeling tool for predicting soft robots'‌ behavior, but its computation time can limit practical‌ applications. In this paper 16, a learning-based‌ approach based on condensation of the FEM model‌ is detailed. The proposed method handles several kinds‌ of actuators and contacts with the environment. We‌ demonstrate that this compact model can be learned‌ as a unified model across several designs and‌ remains very efficient in terms of modeling since‌ we can deduce the direct and inverse kinematics‌ of the robot. Building upon the intuition introduced‌ in [11], the learned model is presented as‌ a general framework for modeling, controlling, and designing‌ soft manipulators. First, the method's adaptability and versatility‌ are illustrated through optimizationbased control problems involving positioning‌ and manipulation tasks with mechanical contact-based coupling. Secondly,‌ the lowmemory consumption and the high prediction speed‌ of the learned condensed model are leveraged for‌ real-time embedding control without relying on costly online‌ FEM simulation. Finally, the ability of the learned‌ condensed FEM model to capture soft robot design‌ variations and its differentiability are leveraged in calibration‌ and design optimization applications.

8.3.15 Infinite-Horizon Value Function‌ Approximation for Model Predictive Control

Participants: Armand Jordana‌, Sébastien Kleff, Arthur Haffemayer, Joaquim‌ Ortiz-Haro, Justin Carpentier, Nicolas Mansard,‌ Ludovic Righetti.

Model Predictive Control has emerged‌ as a popular tool for robots to generate‌ complex motions. However, the real-time requirement has limited‌ the use of hard constraints and large preview‌ horizons, which are necessary to ensure safety and‌ stability. In practice, practitioners have to carefully design‌ cost functions that can imitate an infinite horizon‌ formulation, which is tedious and often results in‌ local minima. In this work 17, we‌ study how to approximate the infinite horizon value‌ function of constrained optimal control problems with neural networks using value iteration‌ and trajectory optimization. Furthermore,‌ we demonstrate how using‌‌ this value function approximation as a terminal cost‌ provides global stability to‌ the model predictive controller.‌‌ The approach is validated on two toy problems‌ and a real-world scenario‌ with online obstacle avoidance‌‌ on an industrial manipulator where the value function‌ is conditioned to the‌ goal and obstacle.

8.4‌‌ Image restoration and enhancement

8.4.1 A New Statistical‌ Model of Star Speckles‌ for Learning to Detect‌‌ and Characterize Exoplanets in Direct Imaging Observations

Participants:‌ Théo Bodrito, Olivier‌ Flasseur, Julien Mairal‌‌, Jean Ponce, Maud Langlois, Anne-Marie‌ Lagrange.

The search‌ for exoplanets is an‌‌ active field in astronomy, with direct imaging as‌ one of the most‌ challenging methods due to‌‌ faint exoplanet signals buried within stronger residual starlight.‌ Successful detection requires advanced‌ image processing to separate‌‌ the exoplanet signal from this nuisance component. This‌ paper 19 presents a‌ novel statistical model that‌‌ captures nuisance fluctuations using a multi-scale approach, leveraging‌ problem symmetries and a‌ joint spectral channel representation‌‌ grounded in physical principles. Our model integrates into‌ an interpretable, end-to-end learnable‌ framework for simultaneous exoplanet‌‌ detection and flux estimation. The proposed algorithm is‌ evaluated against the state‌ of the art using‌‌ datasets from the SPHERE instrument operating at the‌ Very Large Telescope (VLT).‌ It significantly improves the‌‌ precision-recall trade-off, notably on challenging datasets that are‌ otherwise unusable by astronomers.‌ The proposed approach is‌‌ computationally efficient, robust to varying data quality, and‌ well suited for large-scale‌ observational surveys.

8.4.2 Deep‌‌ learning for exoplanet detection and characterization by direct‌ imaging at high contrast‌

Participants: Théo Bodrito,‌‌ Olivier Flasseur, Julien Mairal, Jean Ponce‌, Maud Langlois,‌ Anne-Marie Lagrange.

Exoplanet‌‌ imaging is a major challenge in astrophysics due‌ to the need for‌ high angular resolution and‌‌ high contrast. We present a multi-scale statistical model‌ 20 for the nuisance‌ component corrupting multivariate image‌‌ series at high contrast. Integrated into a learnable‌ architecture, it leverages the‌ physics of the problem‌‌ and enables the fusion of multiple observations of‌ the same star in‌ a way that is‌‌ optimal in terms of detection signal-to-noise ratio. Applied‌ to data from the‌ VLT/SPHERE instrument, the method‌‌ significantly improves the detection sensitivity and the accuracy‌ of astrometric and photometric‌ estimation.

8.4.3 Joint statistical‌‌ modeling and deep learning for exoplanet detection and‌ characterization by direct imaging‌ at high contrast

Participants:‌‌ Théo Bodrito, Olivier Flasseur, Julien Mairal‌, Jean Ponce,‌ Maud Langlois, Anne-Marie‌‌ Lagrange.

The detection of exoplanets, the characterization‌ of their atmospheres, and‌ the study of exoplanet‌‌ formation mechanisms are major current challenges in astrophysics.‌ High-contrast direct imaging (HCI)‌ is one of the‌‌ observational techniques of choice to address these questions.‌ However, such observations are‌ particularly demanding due to‌‌ the extreme contrast levels and angular resolution required.‌ In addition to the‌ use of extreme adaptive‌‌ optics and coronagraphs, advances‌ in data science have become critical for analyzing‌ these observations and disentangling the signals of interest‌ (exoplanets and circumstellar disks) from the strong nuisance‌ component (speckles and noise) that corrupts the data.‌ In this context, we will present our recent‌ developments in deep learning applied to HCI 21‌, aimed at the optimal and reliable extraction‌ of astrophysical information from multivariate observations (including spatial,‌ temporal, spectral, and multi-epoch diversity). These approaches are‌ based on a fine modeling of the different‌ components contributing to the total signal and incorporate‌ physical domain knowledge as prior information. Emphasis will‌ be placed on (i) combining deep learning models‌ with statistical modeling of the nuisance, (ii) leveraging‌ large archival datasets as a valuable source of‌ diversity for tackling the unmixing task, and (iii)‌ jointly exploiting the spectral diversity of observations. Our‌ methods are tailored to the specific challenges of‌ high-contrast imaging: (i) very low signal-to-noise ratios and‌ non-stationary noise, (ii) detection of rare events, and‌ (iii) absence of ground truth. Using data from‌ the VLT/SPHERE instrument, we will show that these‌ approaches enable fine modeling and effective subtraction of‌ the nuisance component, leading to reliable and nearly‌ optimal estimates of the astrophysical quantities of interest.‌ This results in significantly improved detection sensitivity and‌ more accurate astro-photometric characterization. The proposed approaches are‌ also scalable and readily applicable to large-scale surveys.‌ Looking ahead, instruments on the next generation of‌ thirty-meter-class telescopes will enable the exploration of the‌ innermost environments of Sun-like stars at unprecedented contrast‌ levels. Achieving the associated scientific goals will require‌ addressing several data science challenges: (i) approaching the‌ ultimate performance limits of the instruments through optimal‌ signal extraction, (ii) capturing complex, spatially structured nuisance‌ exhibiting strong variability, and (iii) building robust nuisance‌ models that go beyond the limitations of angular‌ differential imaging, particularly in the vicinity of the‌ host star. We will discuss these challenges in‌ light of the methodological developments presented.

8.4.4 Modèle‌ statistique apprenable de mélange de distributions et fusion‌ de données multivariées pour l'imagerie d'exoplanètes

Participants: Théo‌ Bodrito, Olivier Flasseur, Julien Mairal,‌ Jean Ponce, Maud Langlois, Anne-Marie Lagrange‌.

Exoplanet imaging is a major challenge in‌ astrophysics due to the high star-planet contrast. This‌ paper 22 presents a multi-scale statistical model for‌ the nuisance component corrupting multivariate image series. Integrated‌ into a learnable architecture, it leverages the physics‌ of the problem and enables the fusion of‌ multiple observations of the same star. Applied to‌ real data, the method significantly improves the detection‌ sensitivity and the accuracy of exoplanet position and‌ flux estimation.

8.4.5 CoDEx: Combining Domain Expertise for‌ Spatial Generalization in Satellite Image Analysis

Participants: Abhishek‌ Kuriyal, Elliot Vincent, Mathieu Aubry,‌ Loic Landrieu.

Global variations in terrain appearance‌ raise a major challenge for satellite image analysis,‌ leading to poor model performance when training on‌ locations that differ from those encountered at test time. This remains true‌ even with recent large‌ global datasets. To address‌‌ this challenge, we propose a novel domain-generalization framework‌ for satellite images 26‌. Instead of trying‌‌ to learn a single generalizable model, we train‌ one expert model per‌ training domain, while learning‌‌ experts' similarity and encouraging similar experts to be‌ consistent. A model selection‌ module then identifies the‌‌ most suitable experts for a given test sample‌ and aggregates their predictions.‌ Experiments on four datasets‌‌ (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains‌ over existing domain generalization‌ and adaptation methods.

8.5‌‌ Doctoral dissertations and habilitation theses

8.5.1 Deep learning‌ for exoplanet detection in‌ high contrast imaging

Participants:‌‌ Théo Bodrito.

The thesis 37 addresses the‌ challenge of detecting and‌ characterizing exoplanets through direct‌‌ imaging, a technique hindered by the extreme contrast‌ and small angular separation‌ between stars and planets.‌‌ To overcome these issues, this work introduces three‌ hybrid approaches that combine‌ statistical modeling with deep‌‌ learning, leveraging large datasets from high-contrast imaging surveys.‌ The deep PACO method‌ integrates a local statistical‌‌ model of the nuisance component with a convolutional‌ neural network, improving detection‌ and characterization performance over‌‌ classical algorithms. MODEL&CO further advances this by learning‌ a unique model across‌ a large multi-observations datasets,‌‌ enabling robust detection even in challenging conditions. ExoMILD‌ introduces a multi-scale, multi-spectral‌ statistical framework that exploits‌‌ spatial symmetries, achieving superior sensitivity and unbiased parameter‌ estimation. Extensive testing on‌ semi-synthetic and real datasets‌‌ demonstrates significant gains in contrast and robustness, particularly‌ at small separations. The‌ approaches are designed to‌‌ generalize across diverse observing conditions and are well-suited‌ for future large-scale surveys.‌ Overall, the thesis establishes‌‌ a new generation of deep learning-based tools for‌ exoplanet imaging, enabling more‌ sensitive and reliable exploration‌‌ of planetary systems.

8.5.2 Learning dexterous manipulation from‌ 3D hand and object‌ interaction

Participants: Zerui Chen‌‌.

In this thesis 39, we advance‌ the understanding of 3D‌ hand motions and hand-object‌‌ interactions in monocular videos. We show that how‌ these insights can empower‌ robots with human-like dexterous‌‌ manipulation capabilities. Our approach achieves dense 3D reconstructions‌ of both hands and‌ objects, capturing their fine-grained‌‌ interactions while maintaining fast inference speed. We investigate‌ how to leverage these‌ 3D reconstruction results to‌‌ transfer human manipulation skills to multi-fingered robotic hands‌ through trajectory-guided reinforcement learning‌ and vision-based imitation learning.‌‌ By effectively connecting visual motion capture with robotic‌ execution, our work creates‌ new opportunities for human-robot‌‌ collaboration. Our contributions are structured into four key‌ areas: First, we propose‌ a joint learning framework‌‌ for 3D reconstruction of hands and objects using‌ signed distance functions (SDFs).‌ This method generates high-resolution‌‌ meshes and captures detailed hand-object interactions. Second, to‌ improve the alignment between‌ the reconstructed 3D shape‌‌ and its underlying poses, we leverage hand kinematic‌ structures to guide SDF-based‌ reconstruction, which helps enhance‌‌ visual features and increase robustness to occlusions. Third,‌ while SDF-based methods yield‌ promising results, they are‌‌ computationally intensive and often‌ produce overly smooth surfaces. To address this, we‌ introduce a novel transformer-based approach for reconstructing dense‌ point clouds of hand-held objects, achieving high-quality 3D‌ reconstructions with fast inference speed. Finally, although vision‌ systems produce visually plausible 3D hand and object‌ configurations, these configurations may not always be physically‌ plausible, which make them less useful for robot‌ learning. To tackle this, we develop ViViDex and‌ first employ reinforcement learning to refine these noisy‌ configurations. Then, we apply imitation learning to train‌ a unified vision-based policy from refined trajectories. As‌ a result, ViViDex generates natural manipulation sequences and‌ demonstrates superior performance across three dexterous manipulation tasks.‌

8.5.3 Analysis of satellite image time series for‌ classification and change detection

Participants: Elliot Vincent.‌

This thesis 41 develops machine learning methods for‌ analyzing time series of satellite images (STIS), focusing‌ on soil classification and semantic change detection. We‌ propose three main areas of improvement. First, we‌ design architectures specifically tailored for STIS: DTI-TS for‌ agricultural classification, multiUTAE for change detection, and a‌ combination of a foundation model with a temporal‌ attention model for detecting archaeological looting. Second, we‌ address the lack of annotated data by developing‌ weakly supervised methods and introducing a dataset for‌ archaeological looting detection. Finally, we examine the impact‌ of spatial and temporal domain shifts on model‌ performance. The DTI-TS method aligns time series prototypes‌ with data using spectral and temporal transformations. It‌ excels in contexts with temporal shifts and data‌ scarcity while maintaining good interpretability. MultiUTAE segments all‌ images in a series simultaneously, leveraging information over‌ a broad temporal window. This sequence-to-sequence approach outperforms‌ methods that process images individually or in pairs‌ across various domain shift scenarios. For archaeological looting‌ detection, the thesis introduces DAFA-LS, a dataset of‌ Afghan sites. The best performance is achieved by‌ a method combining a pre-trained foundation model and‌ an attention model. Future research directions include leveraging‌ foundation models and multi-modality, enhancing time series by‌ improving resolution or adding elevation data, and developing‌ unsupervised learning and domain adaptation to mitigate the‌ lack of annotated data.

8.5.4 Object-centric representations for‌ sensing and planning in visually-guided robotics

Participants: Thomas‌ Chabal.

This thesis 38 pursues the ultimate‌ goal to develop autonomous and intelligent robotic assistants‌ with the abilities to perceive and understand the‌ world, and explore and act on it. Such‌ systems would have to navigate to and interact‌ with a variety of individual components, or objects.‌ Working from images, this thesis specifically aims at‌ developing novel object-centric representations and algorithms to perceive‌ and reconstruct scenes, before planning robotic manipulation and‌ navigation actions. It addresses three main challenges: handling‌ failures of visual systems and partial knowledge of‌ goals in robotic assembly tasks, efficiently acquiring accurate‌ and complete object-level representations of scenes in an‌ online fashion, and learning to understand the semantics‌ of human-arranged environments to explore and search for‌ objects with a mobile robot. We advance the field through the following‌ three contributions. First, we‌ study the problem of‌‌ stacking objects with a robotic manipulator to reproduce‌ an assembly specified through‌ a single photograph. As‌‌ visual systems encounter unavoidable failures in analyzing images,‌ notably due to occlusions,‌ the target structure is‌‌ only partially known. We present an approach intertwining‌ an abstract search for‌ a high-level assembly plan‌‌ and a physical grounding of candidate plans in‌ the real world. Our‌ method, deployed on a‌‌ robotic manipulator, builds stable structures that match the‌ goal assemblies, known by‌ extracting object poses with‌‌ an off-the-shelf procedure in the goal image. Beyond‌ fixed robots with a‌ limited access to observations‌‌ of their surroundings, we consider cameras that freely‌ move in the physical‌ world and explore online‌‌ scene reconstruction at the level of objects from‌ a stream of posed‌ RGB-D frames. We model‌‌ objects as neural implicit representations, entailing feature grids‌ and small perceptrons, optimized‌ per scene with differentiable‌‌ rendering. We propose a feature grid interpolation scheme‌ to adapt to novel‌ views of yet unseen‌‌ object parts, as well as a relocalization approach‌ to reuse object models‌ in novel scenes and‌‌ an update procedure synthesizing views from past viewpoints‌ to increase the completion‌ of reconstructed objects in‌‌ novel sequences. Finally, we focus on robot navigation‌ towards objects specified as‌ categories in unknown environments,‌‌ a task requiring accurate scene understanding and efficient‌ exploration. We introduce an‌ online frontier-object mapping with‌‌ rich visual and semantic representations of frontiers, or‌ boundaries of the explored‌ area, and object instances.‌‌ Our navigation strategy combines a high-level goal prediction‌ stage relying on a‌ vision-language model endowed with‌‌ learnt navigation-specific encoders and decoders and a low-level‌ path planner that generates‌ trajectories. Our modular framework,‌‌ dubbed FOM-Nav for Frontier-Object Maps, is trained on‌ an automatically self-collected large-scale‌ navigation dataset in scanned‌‌ environments and significantly improves exploration efficiency over prior‌ works.

8.5.5 Learning Visuomotor‌ Policies for Robotic Manipulation‌‌

Participants: Ricardo Garcia.

This thesis 40 focuses‌ on the development of‌ representations and learning algorithms‌‌ to perform visually-guided robotic manipulation tasks in unstructured‌ environments. One of the‌ final goals of robotics‌‌ is to train robots that can autonomously solve‌ a wide range of‌ tasks in the real‌‌ world based on human instructions. To make progress‌ towards this goal, this‌ manuscript covers three main‌‌ challenges: closing the sim-to-real gap for visuomotor policies,‌ integrating 3D point cloud‌ representations with language instructions‌‌ to improve the performance of robotic manipulation in‌ multi-task settings, and developing‌ a generalist language-guided visuomotor‌‌ policy for robotic manipulation. We first address the‌ challenge of sim-to-real transfer‌ in robotic manipulation. Training‌‌ policies in simulation is less time-consuming and safer‌ than in the real‌ world. However, the discrepancies‌‌ between simulation and the real environment can limit‌ the transferability of policies‌ trained in simulation to‌‌ the real world. This issue is known as‌ the sim-to-real gap, and‌ domain randomization (DR) is‌‌ a known technique to‌ address this gap. DR allows to perform sim-to-real‌ policy transfer by randomizing the simulation's appearance (textures,‌ lighting, object colors, and camera viewpoints) during training.‌ However, selecting the right range of parameters randomization‌ is not trivial. This thesis proposes a data-driven‌ strategy to systematically select the DR parameters using‌ multi-object localization as a proxy task. We then‌ focus on language-guided robotic manipulation and propose PolarNet‌ and 3D-LOTUS, two 3D point cloud-based methods to‌ integrate visual inputs and language instructions to predict‌ manipulation actions. Both methods use efficient point cloud‌ encoders and multimodal transformers to combine the text‌ instructions and the geometric information from point clouds,‌ enabling more accurate and efficient manipulation than 2D‌ image-based approaches. Both 3D-based policies outperform state-of-the-art models‌ across various multi-task settings of the RLBench benchmark‌ and successfully transfer to the real-world robot, highlighting‌ their performance in diverse environments. The last part‌ of this thesis focuses on developing generalist robot‌ policies for robotic manipulation. First, we propose a‌ GemBench, a comprehensive benchmark for evaluating the generalization‌ capabilities of such policies on a set of‌ tasks with four levels of increasing difficulty: (1)‌ novel object placements, (2) novel rigid objects, (3)‌ novel articulated objects, and (4) long-horizon tasks that‌ require sequential planning. We then propose 3D-LOTUS++, which‌ extends our point cloud-based policy 3D-LOTUS by incorporating‌ large language models (LLMs) for task planning and‌ vision-language models (VLMs) for object grounding. This modular‌ framework achieves state-of-the-art performance on this new benchmark.‌ Through these contributions, this thesis advances the development‌ of robust, precise, and generalist visuomotor policies for‌ robotic manipulation.

9 Bilateral contracts and grants with‌ industry

9.1 Bilateral contracts with industry

9.1.1 Louis‌ Vuitton/ENS chair on artificial intelligence

Participants: Jean Ponce‌.

The scientific chair Louis Vuitton - École‌ normale supérieure in Artificial Intelligence has been created‌ in 2017 and inaugurated on April 12, 2018‌ by the ENS Director Marc Mézard and the‌ LV CEO Michael Burke. The goal of the‌ chair is to establish a close collaboration between‌ LV and ENS in the area of Artificial‌ Intelligence. The chair enjoys the generous annual contribution‌ of 200K Euros provided by LV in support‌ of research activities in statistical learning and computer‌ vision. In particular, the chair supports the costs‌ of researchers, students, missions, computational resources as well‌ as seminars and meetings, including the two days‌ of meeting annually organized by LV and ENS.‌ During 2020 ENS and LV have organized several‌ joint meetings with the participation of researchers from‌ SIERRA and WILLOW teams. The chair has also‌ supported the hiring of one PhD student at‌ the WILLOW team, missions to conferences and international‌ research labs as well as data collection for‌ research projects. In 2020 the chair has been‌ extended to the next three-year period until 2023.‌ We are planning to start a CIFRE PhD‌ of François Gardères together with Louis Vuitton in‌ 2023.

9.1.2 Casino/ENS chair on algorithmic and machine learning

Participants: Justin Carpentier‌.

The scientific chair‌ Casino/ENS - École normale‌‌ supérieure on algorithmic and machine learning has been‌ created in 2021. J.‌ Carpentier is in charge‌‌ of the robotics axis of this chair.

10‌ Partnerships and cooperations

10.1‌ International research visitors

10.1.1‌‌ Visits of international scientists

Other international visits to‌ the team

Marc Toussaint‌

Status
Professor
Institution of‌‌ origin:
TU Berlin
Country:
Germany
Dates:
1 month‌
Context of the visit:‌
collaboration
Mobility program/type of‌‌ mobility:
research stay

Mike Tarr

Status
Professor
Institution‌ of origin:
Carnegie Mellon‌ University
Country:
US
Dates:‌‌
1 month
Context of the visit:
collaboration
Mobility‌ program/type of mobility:
sabbatical‌

Baohe Zhang

Status
PhD‌‌ student
Institution of origin:
University of Freiburg
Country:‌
Germany
Dates:
5 months‌
Context of the visit:‌‌
Collaboration on world modeling for robotics
Mobility program/type‌ of mobility:
research stay‌

10.2 European initiatives

10.2.1‌‌ Horizon Europe

AGIMUS

AGIMUS project on cordis.europa.eu

Title:‌
Next generation of AI-powered‌ robotics for agile production‌‌
Duration:
From October 1, 2022 to September 30,‌ 2026
Partners:
- INSTITUT NATIONAL‌ DE RECHERCHE EN INFORMATIQUE‌‌ ET AUTOMATIQUE (INRIA), France
- AIRBUS, France
- KLEEMANN HELLAS‌ SA (KLEEMANN HELLAS SA),‌ Greece
- PAL ROBOTICS SLU‌‌ (PAL ROBOTICS), Spain
- Q-PLAN INTERNATIONAL ADVISORS PC (Q-PLAN‌ INTERNATIONAL), Greece
- PAL FRANCE,‌ France
- THIMM OBALY, K.S.,‌‌ Czechia
- CESKE VYSOKE UCENI TECHNICKE V PRAZE (CVUT),‌ Czechia
- CENTRE NATIONAL DE‌ LA RECHERCHE SCIENTIFIQUE CNRS‌‌ (CNRS), France
Inria contact:
Justin Carpentier
Coordinator:
Summary:‌
AGIMUS aims to deliver‌ an open-source breakthrough innovation‌‌ in AI-powered agile production, introducing solutions that push‌ the limits of perception,‌ planning, and control in‌‌ robotics, enabling general-purpose robots to be quick to‌ set-up, autonomous and to‌ easily adapt to changes‌‌ in the manufacturing process. To achieve such agile‌ production, AGIMUS leverages on‌ cutting-edge technologies and goes‌‌ beyond the state-of-the-art to equip current mobile manipulators‌ with a combination of‌ (i) an advanced task‌‌ and motion planner that can learn from online‌ available video demonstrations; (ii)‌ optimal control policies obtained‌‌ from advances in reinforcement learning based on efficient‌ differentiable physics simulations of‌ the manufacturing process; as‌‌ well as (iii) advanced perception algorithms able to‌ handle objects and situations‌ unseen during initial training.‌‌ Along the way, optimization of energy efficiency and‌ the use of 5G‌ technology will support further‌‌ pushing the limits of autonomy. The AGIMUS solutions‌ and their impact will‌ be demonstrated and thoroughly‌‌ stress-tested in 3 testing zones, as well as‌ 3 industrial pilots in‌ Europe, under numerous diverse‌‌ real-world case studies and scenarios (different tools, environments,‌ processes, etc.). In every‌ step, and from the‌‌ very beginning, AGIMUS will go beyond current norms‌ and involve a wide‌ range of stakeholders, starting‌‌ from the production line itself, to identify the‌ essential ethical-by-design principles and‌ guidelines that can maximise‌‌ acceptance and impact.

ARTIFACT

ARTIFACT project on cordis.europa.eu‌

Title:
The Artificial Motion‌ Factory
Duration:
From September‌‌ 1, 2025 to August 31, 2030
Partners:
- INSTITUT‌ NATIONAL DE RECHERCHE EN‌ INFORMATIQUE ET AUTOMATIQUE (INRIA),‌‌ France
Inria contact:
Justin‌ Carpentier
Coordinator:
Summary:

Todays robots are confined to‌ tightly controlled environments: even the complex choreographies that‌ the Atlas humanoid flawlessly executes heavily rely on‌ handcrafted control strategies and detailed workspace models, with‌ little place for sensing. To put it bluntly,‌ robots are nowhere near the level of agility,‌ dexterity, and even less so autonomy, robustness, and‌ safety required for their deployment in the wild‌ alongside people.

The tenet of ARTIFACT is that‌ the key to an actual revolution will come‌ from the algorithmic foundations of artificial motion intelligence,‌ an AI challenged from the start to interact‌ physically with dynamic environments and, ultimately, people. To‌ do so, we will break away from the‌ dichotomy between optimal control, where the role of‌ perception is traditionally limited to an early state‌ estimation stage, and reinforcement learning, where control policies‌ are typically learned model-free with no guarantee to‌ cope with the curse of dimensionality.

In ARTIFACT,‌ we will devise a unified, structured, modular, and‌ learnable control architecture for providing robots with advanced‌ decision-making capabilities to solve complex tasks and face‌ new interactions as they experience the world. It‌ will leverage the notion of differentiable programming at‌ all scales to enable robots to (i) capture‌ models of their interactions directly from a sound‌ combination of sensor data and first principles from‌ physics, (ii) autonomously discover new complex gestures and‌ movements leveraging their past experiences, and (iii) learn‌ embodied representations to control their interactions finely and‌ reason about the physical world. It will be‌ implemented in open-source software and shown in real-world‌ and challenging scenarios requiring fine dexterity and high‌ agility. Altogether, these contributions will be the key‌ enablers to enhance robot autonomy fundamentally, thus opening‌ the age of ubiquitous robots at the service‌ of mankind.

ExTRAORDiNary

ExTRAORDiNary project on cordis.europa.eu

Title:‌
Accelerating Differentiable Robot Dynamics Simulation for Advanced Control‌
Duration:
From May 1, 2025 to April 30,‌ 2027
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE‌ ET AUTOMATIQUE (INRIA), France
Inria contact:
Justin Carpentier‌
Coordinator:
Summary:

Differentiable robot dynamics simulation is a‌ crucial enabler of advanced robot control. It is‌ at the heart of both model predictive control‌ (MPC) and learning-based approaches (e.g. reinforcement learning [RL]),‌ which are among the most successful and actively‌ researched robot control algorithms. Increased usage of the‌ computationally demanding MPC/RL controllers has led to a‌ growing need for efficient dynamics simulators. However, existing‌ simulators internally use inefficient high-complexity (worst-case cubic) constrained‌ dynamics algorithms (CDA) and are often inefficiently implemented‌ leading to a slowdown of several factors compared‌ to a fast simulator like Pinocchio.

Addressing these‌ concerns, we will accelerate the differentiable simulation through‌ three complementary strategies. We will 1) leverage low-complexity‌ CDAs, 2) use Pinocchio's proven efficient software design‌ patterns and explore further acceleration via code generation‌ computations, and 3) derive efficient algorithms for differentiating‌ through contact simulation.

Furthermore, our simulator will solve‌ the nonlinear complementarity problem of frictional contact without making physics-compromising relaxations like‌ existing simulators and will‌ be publicly available as‌‌ part of the widely used open-source Pinocchio library.‌ By adding key enhancements‌ to Pinocchio, we will‌‌ make it a viable alternative to the inefficient,‌ but feature-rich software simulators.‌ The visibility, impact and‌‌ usability of our simulator will be enhanced by‌ addressing some low-hanging fruits‌ in MPC, RL and‌‌ physics identification applications.

This projects contributions will not‌ only pave the way‌ towards fast whole-body controllers‌‌ and faster and more sustainable RL training (important‌ a time surge of‌ RL research activity), but‌‌ will also impact adjacent fields like bio-mechanics and‌ computer graphics in the‌ long term

LiftMeUp

LiftMeUp‌‌ project on cordis.europa.eu

Title:
LiftMeUp: Globally optimal algorithms‌ for dexterous manipulation and‌ locomotion
Duration:
From May‌‌ 1, 2025 to April 30, 2027
Partners:
- INSTITUT‌ NATIONAL DE RECHERCHE EN‌ INFORMATIQUE ET AUTOMATIQUE (INRIA),‌‌ France
Inria contact:
Justin Carpentier
Coordinator:
Summary:

Robots‌ bear the potential to‌ help solve the world’s‌‌ pressing problems by enabling and scaling up operations‌ beyond human capacities. To‌ successfully manipulate objects and‌‌ perform reliable locomotion, robots require adequate models and‌ solvers. Traditionally, physics-based models‌ and iterative solvers are‌‌ used, and obtaining reliable solutions requires significant effort‌ in model tuning and‌ heuristics for good convergence.‌‌ LiftMeUp’s objective is to combine data-driven modeling with‌ globally optimal solvers in‌ a unique way to‌‌ create an easy-to-use framework for the life-long operation‌ of robots in challenging‌ tasks. The result is‌‌ a transparent, sample-efficient alternative to the less interpretable‌ and resource-hungry deep-learning solutions‌ for robotics. Furthermore, LiftMeUp‌‌ builds on providing certifiably optimal methods with important‌ consequences for safety and‌ efficiency, as opposed to‌‌ deep learning and local solvers, where different initializations‌ can lead to entirely‌ different solutions.

LiftMeUp is‌‌ carried out at WILLOW, Inria Paris, known for‌ cutting-edge control and locomotion‌ research, and has three‌‌ stages: first, combining concepts from Koopman theory, polynomial‌ optimization, and kernel methods,‌ lifting functions are inferred‌‌ from data and integrated into globally optimal methods‌ for state estimation and‌ control. Second, different models‌‌ are optimally combined, leading to a modular framework‌ that can be incrementally‌ updated online. Lastly, these‌‌ novel algorithms are implemented on hardware to solve‌ real-world locomotion and dexterous‌ manipulation tasks.

This framework‌‌ will have an important scientific impact by creating‌ novel connections between global‌ optimization and machine learning,‌‌ enabling the use of principled over heuristic solvers‌ in a broad range‌ of applications in robotics‌‌ and beyond. It will entail energy and time‌ savings for the economy‌ and using sample-efficient and‌‌ transparent models will democratize technology and build trust.‌

10.3 National initiatives

10.3.1‌ PRAIRIE

Participants: Justin Carpentier‌‌, Jean Ponce, Cordelia Schmid.

The‌ Prairie Institute (PaRis AI‌ Research InstitutE) is one‌‌ of the four French Institutes for Interdisciplinary Artificial‌ Intelligence Research (3IA), which‌ were created as part‌‌ of the national French initiative on AI announced‌ by President Emmanuel Macron‌ on May 29, 2018.‌‌ It brings together five‌ academic partners (CNRS, Inria, Institut Pasteur, PSL University,‌ and University of Paris) as well as 17‌ industrial partners, large corporations which are major players‌ in AI at the French, European and international‌ levels, as well as 45 Chair holders, including‌ three of the members of WILLOW (Carpentier, Ponce,‌ Schmid). Ponce is the scientific director of PRAIRIE.‌

10.3.2 PR[AI]RIE-PSAI

Participants: Justin Carpentier, Stephane Caron‌, Shizhe Chen, Jean Ponce, Cordelia‌ Schmid.

PR[AI]RIE-PSAI (Paris School of AI) is‌ one of the 9 French AI-Clusters financed by‌ France 2030 for a total of 75m€ over‌ 5 years (2024-2029). Created in 2019 by CNRS,‌ Inria, Institut Pasteur, PSL University, Université de Paris‌ Cité, and a club of industrial partners, PaRis‌ Artificial Intelligence Research InstitutE (PR[AI]RIE) was one of‌ the 4 Interdisciplinary Institutes for AI research set‌ up as part of the national strategy for‌ Artificial Intelligence announced by the President of the‌ French Republic in March 2018. Starting in 2024‌ it has evolve to become the PR[AI]RIE Paris‌ School of AI (PR[AI]RIE-PSAI) and cover the triptych‌ of research-training-innovation. PR[AI]RIE-PSAI's activities are supported by 125‌ internationally renowned scientists, specialists in AI with diverse‌ fields of application, such as biology, health, physics,‌ transport or the environment, working in collaboration with‌ public and private actors in these sectors. It‌ includes the five faculties of WILLOW.

10.3.3 VideoPredict:‌ Predicting future video content

Participants: Cordelia Schmid,‌ Jean Ponce.

Predicting future video content is‌ a challenging problem with high potential impact in‌ downstream tasks such as self-driving cars and robotics,‌ but also much promise for the learning process‌ itself, from self-supervised learning to data augmentation. Existing‌ approaches range from predicting future actions with semantic‌ labels to creating realistic renderings of future frames.‌ Most of them use straight predictions from convolutional‌ features of previous frames. We propose instead to‌ model the causality effects involved in the video‌ formation process, and disentangle motion and appearance factors.‌ This will result in better prediction, but also‌ and maybe more importantly in a better, more‌ structured understanding of the video content, leading to‌ explicable and interpretable results, and eventually to more‌ trustworthy learning systems. The German and French partners‌ are, respectively, experts in machine learning and computer‌ vision, with complementary research threads in causality and‌ disentangled data models on the one hand, and‌ video understanding and action recognition on the other‌ hand, that are ideally suited for this collaborative‌ project

10.3.4 PEPR Organic Robotics

Participants: Justin Carpentier‌, Stephane Caron, Megane Millan, Umit‌ Bora Gokbakan, Etienne Menager.

The PEPR‌ O2R "Organic Robotics" aims to initiate a change‌ in robotics to create a new generation of‌ robots capable of fluid and natural interactions with‌ users, of social adaptation in their interactions, and‌ which accompanies the technological transitions of societies by‌ producing adapted, responsive and reliable services to citizens.‌ In the frame of this national program, WILLOW is involved in Structuring‌ Action 2 (Robot motion‌ with physical interactions and‌‌ social adaptation) led by Philippe Souères at LAAS-CNRS,‌ and Structuring Action 4‌ (Modelling, Simulation, Multi-scale, and‌‌ Biomechanics) led by Jérémie Dequidt at Inria DEFROST.‌ J. Carpentier is also‌ a member of the‌‌ executive committee of the PEPR.

10.3.5 ARTIFACT (ANR‌ Tremplin): La fabrique du‌ mouvement artificiel

Participants: Justin‌‌ Carpentier, Megane Millan, Franki Nguimatsia Tiofack‌.

Les robots modernes‌ restent confinés dans des‌‌ environnements étroitement contrôlés et même les chorégraphies complexes‌ que les humanoïdes de‌ Boston Dynamics exécutent sans‌‌ faille, dépendent fortement de la capture des mouvements,‌ de stratégies de contrôle‌ élaborées à la main‌‌ et de modèles détaillés de l'espace de travail,‌ qui laissent peu de‌ place à la perception.‌‌ En clair, les robots sont loin d'atteindre le‌ niveau d'agilité et de‌ dextérité, et encore moins‌‌ l'autonomie, la robustesse et la sécurité nécessaires à‌ leur déploiement "dans la‌ nature" aux côtés de‌‌ l'homme. Un bond en avant dans ces capacités‌ est nécessaire pour qu'ils‌ tiennent leurs promesses et‌‌ sortent véritablement du laboratoire. Notre principe est que‌ la clé de cette‌ révolution est le développement‌‌ des fondements théoriques et algorithmiques d'une véritable intelligence‌ artificielle du mouvement, une‌ IA qui doit relever‌‌ le défi supplémentaire d'interagir physiquement avec des environnements‌ en évolution dynamique et,‌ en fin de compte,‌‌ avec des personnes. Nous romprons avec la dichotomie‌ entre le contrôle optimal,‌ où le rôle de‌‌ la perception est traditionnellement limité à une étape‌ précoce d'estimation de l'état,‌ et l'apprentissage par renforcement,‌‌ où les politiques de contrôle sont généralement apprises‌ sans modèle, sans garantie‌ de faire face à‌‌ la malédiction de la dimensionnalité. Concrètement, nous utiliserons‌ le formalisme de Koopman‌ des systèmes dynamiques complexes‌‌ pour apprendre les modèles sensorimoteurs et les stratégies‌ de contrôle correspondantes à‌ partir des données des‌‌ capteurs. Nous développerons des méthodes puissantes pour apprendre,‌ contrôler et partager un‌ dictionnaire de synergies sensorimotrices‌‌ à travers les tâches, faisant écho à celles‌ utilisées par le système‌ nerveux central humain dans‌‌ les tâches quotidiennes et accélérant l'acquisition de nouvelles‌ compétences. Nous tirerons parti‌ de la composition des‌‌ stratégies sensorimotrices et des stratégies de recherche arborescente‌ alimentées par des réseaux‌ neuronaux pour planifier de‌‌ manière optimale les mouvements du robot sous des‌ contraintes d'observation dynamiques. Le‌ cadre proposé sera mis‌‌ en œuvre dans de nouvelles architectures logicielles de‌ programmation différentiable et démontré‌ sur plusieurs tâches de‌‌ locomotion et de manipulation, à la fois en‌ simulation et sur des‌ robots réels.

10.3.6 NIMBLE‌‌ (ANR JCJC): Inexact optimization for robot control

Participants:‌ Justin Carpentier, Stephane‌ Caron, Etienne Arlaud‌‌, Jean Ponce, Oumayma Bounou, Joris‌ Vaillant, Wilson Jallet‌, Frederike Dumbgen.‌‌

The limited agility and dexterity of modern robots‌ prevent them from being‌ deployed outside of laboratories,‌‌ not even mentioning outside of factories. With NIMBLE,‌ we want to point‌ the classical sense-plan-act design‌‌ pattern, widely adopted in‌ robotics, as one of the main limiting factor.‌ We propose to replace this three-part control paradigm‌ by learning, from real robot experiments, a predictive‌ model of the robot sensorimotor capabilities. This sensorimotor‌ model will be notably exploited to take complex‌ decisions generalizing to unforeseen situations directly from sensor‌ measurements. While NIMBLE’s innovation takes its roots in‌ the observation of the human motor control organization,‌ it is grounded by advanced and principled mathematical‌ methodologies, in particular, the Koopman operator sitting on‌ top of (deep) learning, and exploits our recognized‌ expertise in robot modeling, optimal control and machine‌ learning for real robots. It will notably enable‌ complex tasks to be defined and executed directly‌ in the sensor space. The success of NIMBLE‌ will be asserted by clear benchmarks in quadrupedal‌ locomotion able to optimally adapt to unstructured terrains‌ and in mobile manipulation for opening unknown doors‌ using the sound combination of force and visual‌ feedback.

10.3.7 INEXACTE (ANR PRCE): Inexact optimization for‌ robot control

Participants: Justin Carpentier, Stephane Caron‌, Etienne Arlaud, Antoine Bambade, Joris‌ Vaillant, Wilson Jallet.

Robotic systems are‌ expected to take a large place in tomorrow’s‌ society, far beyond current industrial robots in tightly‌ controlled factory environments, with large impacts in terms‌ of safety, health at work, comfort and productivity.‌ The motion of robots is typically designed and‌ controlled by specifying numerical objectives and constraints on‌ what they must do, and within which limits.‌ These specifications often conflict, and the actual control‌ must be computed to satisfy all of them‌ in the best possible way. This is naturally‌ achieved by solving a numerical optimization problem. Such‌ problems are often small enough in robotics that‌ they can be solved exactly in theory, but‌ they are always based on models, and by‌ definition, models reflect reality imperfectly, even more so‌ as we get away from tightly controlled (factory)‌ environments.

We propose a complete change of paradigm,‌ to acknowledge that we actually solve inaccurate optimization‌ problems that provide inaccurate solutions by construction, and‌ explore the following two hypotheses: (H1) We can‌ obtain the exact same performance with imprecise numerical‌ solutions, (H2) we can obtain these imprecise numerical‌ solutions using less costly numerical methods, which can‌ be computed faster, using less demanding hardware. To‌ the best of our knowledge, these questions have‌ barely been explored and INEXACT will provide the‌ first comprehensive exploration of this topic.

Our short-term‌ ambition is to significantly lower the computational requirements‌ for solving control problems, taking advantage of the‌ imprecisions inherent to robotics control to compute appropriate‌ solutions faster. But ultimately, our long-term ambition is‌ to design less fragile, less expensive and less‌ polluting robots, since being less dependent on precise‌ models can make us less dependent on precise‌ and therefore complex, fragile, expensive and resource-demanding mechatronics.‌

10.3.8 3D-GEM (ANR JCJC): Learning Generalizable 3D-based Robotic‌ Manipulation Policies

Participants: Shizhe Chen, Justin Carpentier, Stephane Caron,‌ Cordelia Schmid, Jean‌ Ponce.

Robotic manipulation‌‌ in unstructured environments is a long-term goal, with‌ the potential for significant‌ societal and economic impacts‌‌ such as in manufacturing and healthcare. However, current‌ approaches suffer from significant‌ limitations in generalization to‌‌ novel environments, objects and tasks, which are essential‌ for real-world applications. Most‌ learning-based methods are trained‌‌ and evaluated on a narrow range of tasks‌ - typically basic pick-and-place‌ skills, and focus on‌‌ 2D images, lacking crucial 3D understanding. The 3D-GEM‌ project aims to develop‌ cutting-edge robotic manipulation systems‌‌ by leveraging recent breakthroughs in artificial intelligence, particularly‌ large language models and‌ vision foundation models, to‌‌ build 3D-based robotic manipulation foundation models. This initiative‌ will establish a modular‌ framework to tackle critical‌‌ challenges, including data scarcity, generalization, dexterity, and efficiency.‌ The project involves three‌ key thrusts: (1) significantly‌‌ enhancing the scale and quality of robot datasets;‌ (2) advancing 3D embodied‌ perception and task planning‌‌ for comprehending complex 3D scenes and generating high-level‌ grounded plans; (3) learning‌ generalist 3D motion planning‌‌ policies using multimodal sensors and model predictive control.‌ These high-level and low-level‌ modules will function in‌‌ a closed-loop system to enable efficient task execution‌ across diverse scenarios, ultimately‌ improving the versatility and‌‌ effectiveness of robotic systems.

10.4 Regional initiatives

10.4.1‌ AI4IDF

Participants: Justin Carpentier‌, Jean Ponce,‌‌ Etienne Arlaud, Joris Vaillant, Pierre-Guillaume Raverdy‌.

Île-de-France is home‌ to the world's largest‌‌ mathematics community, several of France's largest computer science‌ laboratories, but also a‌ dense industrial fabric in‌‌ artificial intelligence.

In this extremely rich context, the‌ four main Artificial Intelligence‌ (AI) institutes - DATAIA,‌‌ Hi! PARIS, PRAIRIE and SCAI - propose to‌ create an alliance to‌ structure and animate the‌‌ community, and to offer industrial and international partners‌ a unified vision of‌ the exceptional forces at‌‌ work.

11 Dissemination

Participants: Ajay Sathya, Cordelia‌ Schmid, Elliot Vincent‌, Etienne Menager,‌‌ Etienne Arlaud, Francois Garderes, Fabian Schramm‌, Frederike Dumbgen,‌ Franki Nguimatsia Tiofack,‌‌ Gabriel Fiastre, Louis Montaut, Justin Carpentier‌, Jean Ponce,‌ Shizhe Chen, Shiyao‌‌ Li, Sara Pieri, Theotime Le Hellard‌, Matthieu Futeral-Peter,‌ Paul Pacaud, Roland‌‌ Andrews, Thomas Chabal, Valentin Tordjman–Levavasseur,‌ Wilson Jallet, Umit‌ Bora Gokbakan, Zeeshan‌‌ Khan.

11.1 Promoting scientific activities

11.1.1 Scientific‌ events: organisation

Member of‌ the organizing committees

Ellis‌‌ Workshop on Comp. Vision & Machine Learning, BadTeinach,‌ April 2025 (C. Schmid)‌
CVPR 2025 workshop on‌‌ Generalization in Robotics Manipulation Workshop and Challenges, Nashville,‌ June 2025 (S. Chen,‌ C. Schmid)
Video AI‌‌ Symposium, Paris, September 2025 (C. Schmid)
CoRL 2025‌ workshop on Open-Source Hardware‌ in the Era of‌‌ Robot Learning, Seoul, September 2025 (S. Caron)
Workshop‌ on Diverse Optimization and‌ Exploration, Paris, November 2025‌‌ (J. Carpentier)

11.1.2 Scientific events: selection

Chair of‌ conference program committees

IEEE/CVF‌ Computer Vision and Pattern‌‌ Recognition Conference (CVPR) (J.‌ Ponce, C. Schmid, S. Chen)
International Conference on‌ Computer Vision (ICCV) (J. Ponce, C. Schmid)
European‌ Conference on Computer Vision (ECCV) (C. Schmid, S.‌ Chen)
International Conference on Machine Learning (ICML) (C.‌ Schmid, S. Chen)
International Conference on Learning Representations‌ (ICLR) (C. Schmid, S. Chen)
Association for the‌ Advancement of Artificial Intelligence (AAAI) (S. Chen)
RSS‌ Pioneers (A. Sathya, F. Dumbgen)

Member of the‌ conference program committees

Associate editor for the Humanoids‌ 2025 conference (S. Caron, A. Sathya)
Associate editor‌ for the IROS 2025 conference (A. Sathya)
Associate‌ editor for the ICRA 2026 conference (A. Sathya)‌

Reviewer

IEEE-RAS International Conference on Robotics and Automation‌ (ICRA) (S. Caron, Ü. B. Gökbakan, F. N.‌ Tiofack, T. L. Hellard, F. Schramm, A. Sathya,‌ S. Chen, T. Chabal)
IEEE-RAS International Conference on‌ Humanoid Robots (Humanoids) (S. Caron)
IEEE/RSJ International Conference‌ on Intelligent Robots and System (IROS) (S. Caron,‌ Ü. B. Gökbakan, F. Schramm, S. Li, L.‌ Montaut, S. Chen)
International Conference on Learning Representations‌ (ICLR) (T. L. Hellard, S. Pieri)
Annual Meeting‌ of the Association for Computational Linguistics (ACL) (S.‌ Pieri)
Conference on Neural Information Processing Systems (NeurIPS)‌ (S. Pieri, Z. Khan)
IEEE / CVF Computer‌ Vision and Pattern Recognition Conference (CVPR) (S. Pieri,‌ S. Li, Z. Khan)
International Conference on 3D‌ Vision (3DV) (S. Li)
International Conference on Machine‌ Learning (ICML) (Z. Khan)
Robotics: Science and Systems‌ (RSS) (L. Montaut)
IEEE International Conference on Automation‌ Science and Engineering (CASE) (A. Sathya)
International Conference‌ on Computational Linguistics (COLING) (S. Chen)
Conference on‌ Language Modeling (COLM) (S. Chen)
Conference on Robot‌ Learning (CoRL) (S. Chen)
International Conference on Computer‌ Vision (ICCV) (S. Chen, T. Chabal)

11.1.3 Journal‌

Member of the editorial boards

Associate editor, IEEE‌ Transactions on Pattern Analysis and Machine Intelligence (TPAMI)‌ (S. Chen)
Associate editor, IEEE Transactions on Robotics‌ (TRO) (J. Carpentier)
Associate editor, IEEE Robotics and‌ Automation Letters (RAL) (J. Carpentier)

Reviewer - reviewing‌ activities

IEEE Transactions on Robotics (T-RO) (J. Carpentier,‌ S. Caron, L. Montaut, A. Sathya, S. Chen)‌
IEEE Robotics and Automation Letters (RA-L) (S. Caron,‌ Ü. B. Gökbakan, F. Schramm, E. Menager, L.‌ Montaut, A. Sathya, S. Chen, T. Chabal)
International‌ Journal of Robotics Research (IJRR) (E. Menager, A.‌ Sathya)
IEEE-RAS International Conference on Soft Robotics (E.‌ Menager)
IEEE Transactions on Automation Science and Engineering‌ (T-ASE) (A. Sathya)
International Journal of Computer Vision‌ (IJCV) (S. Chen)

11.1.4 Invited talks

J. Carpentier,‌ Conseil Scientifique d'Inria, Paris, Décembre 2025.
J. Carpentier,‌ Workshop on Diverse Optimization for Robotics, Paris, November‌ 2025.
J. Carpentier, 4th International Workshop on AI‌ for Robotics, NAVER LABS Europe, Grenoble, November 2025.‌
J. Carpentier, X-IA event, Paris, November 2025.
J.‌ Carpentier, Table ronde IA et Robotique, Cap Digital,‌ Paris, November 2025.IEEE-RAS Polish Chapter
J. Carpentier, IEEE-RAS‌ Polish Chapter, Remote, September 2025.
J. Carpentier, PAISS‌ Summer School, Grenoble, September 2025.
J. Carpentier, Table‌ ronde Robotique et IA VivaTech, Paris, June 2025.‌
J. Carpentier, RobotSoft workshop, Lausane, April 2025.
J. Carpentier, AI Summit, Février‌ 2025.
C. Schmid, 4th‌ International Workshop on AI‌‌ for Robotics, NAVER LABS Europe, Grenoble, November 2025.‌
C. Schmid, Workshop on‌ Multimodal Representation and Retrieval,‌‌ in conjunction with ICCV’25, October 2025.
C. Schmid,‌ Festvortrag (ceremonial lecture) at‌ Leopoldina annual meeting, Halle,‌‌ September 2025.
C. Schmid, Invited speaker at Video‌ AI Symposium, Paris, September‌ 2025.
C. Schmid, Keynote‌‌ at Building Bridge Conference, Dresden, 2025.
C. Schmid,‌ Invited speaker at Workshop‌ on Pixel-level Vision Foundation‌‌ Models, in conjunction with CVPR’25, June 2025.
C.‌ Schmid, Invited speaker at‌ Workshop on Multimodal Algorithmic‌‌ Reasoning, in conjunction with CVPR’25, June 2025.
C.‌ Schmid, Invited speaker at‌ Workshop on ScanNet++ Novel‌‌ View Synthesis and 3D Semantic Understanding, in conjunction‌ with CVPR’25, June 2025.‌
C. Schmid, Invited speaker‌‌ at Workshop on Video Large Language Models, in‌ conjunction with CVPR’25, June‌ 2025.
C. Schmid, Invited‌‌ speaker at Workshop on Computer Vision in the‌ Wild, in conjunction with‌ CVPR’25, June 2025.
C.‌‌ Schmid, Seminar at Applied and Theoretical Aspects of‌ Robot Intelligence Lab, TUM,‌ Munich, December 2025.
C.‌‌ Schmid, Presentation at Workshop on Diverse Optimization and‌ Exploration, Inria, Paris, November‌ 2025.
C. Schmid, Presentation‌‌ at Malik Fest, University of California, Berkeley, October‌ 2025.
C. Schmid, Presentation‌ at Ellis workshop, Bad‌‌ Teinach, April 2025.
J. Ponce, Forum on AI‌ Frontiers 2025, Seoul, Korea,‌ October 27.
S. Caron,‌‌ CoRL 2025 workshop on Open-Source Hardware in the‌ Era of Robot Learning,‌ September 2025.
S. Caron,‌‌ Harada Lab, The University of Osaka, Japan, October‌ 2025.
S. Caron, LAAS-CNRS,‌ October 2025.
A. Sathya,‌‌ Robotics, Optimization, and Assistive Mobility (ROAM) Lab at‌ University of Notre Dame,‌ USA, July 2025.
A.‌‌ Sathya, EEE Department, IIT-Guwahati, Guwahati, India, August 2025.‌
A. Sathya, Gepetto Team,‌ LAAS-CNRS, Toulouse, October 2025.‌‌
S. Chen, GDR IASIS workshop on Deformable Object‌ Modelling Trends: from Perception‌ to Applications, Paris, April‌‌ 2025.
S. Chen, Invited speaker at Workshop on‌ Computer Vision in the‌ Wild, in conjunction with‌‌ CVPR’25, June 2025.
S. Chen, Demi-heure de science,‌ Inria Paris, July 2025.‌
S. Chen, Korea University,‌‌ October 2025.
S. Chen, THOTH team, Inria Grenoble,‌ November 2025.
S. Chen,‌ Workshop on Foundation Models‌‌ in Robotics, Lyon, November 2025.

11.2 Teaching -‌ Supervision - Juries -‌ Educational and pedagogical outreach‌‌

11.2.1 Teaching

Course at Master MVA: Robotics (S.‌ Caron, J. Carpentier, A.‌ Sathya as teaching assistant)‌‌
Course at ENS-PSL: Planification de mouvement en robotique‌ et en animation graphique‌ (S. Caron)
Course at‌‌ Dauphine-PSL: Formation Chef de projet IA (S. Caron)‌
Lecture at Mines Paris-PSL‌ (formation MAREVA): Reinforcement learning‌‌ for robotics (S. Caron)
Introduction to computer vision,‌ NYU, Spring 2025 (J.‌ Ponce)
Introduction to computer‌‌ vision, ENS-PSL, Fall 2025 (J. Ponce, G. Fiastre‌ as teaching assistant)
Computer‌ vision, Chefs de projets‌‌ IA, Société des Ingénieurs de l'Automobile, June 2025‌ (J. Ponce)
Computer vision,‌ Casino COMEX, We Are‌‌ School, May 2025 (J. Ponce)
Machine Learning and‌ Applications (MALAP), École Nationale‌ des Ponts et Chaussées,‌‌ IP Paris (S. Li‌ as teaching assistant)
Deep Reinforcement Learning lecture at‌ Master MVA: Deep Learning (S. Chen)

11.2.2 Supervision‌

PhD defenses

Zerui Chen, advised by C. Schmid‌ and S. Chen.
Ricardo Garcia-Pinel, advised by C.‌ Schmid and S. Chen.
Thomas Chabal, advised by‌ J. Ponce, C. Schmid and S. Chen.
Matthieu‌ Futeral-Peter, advised by R. Bawden (Inria ALMAnaCH), B.‌ Sagot (Inria ALMAnaCH) and C. Schmid.
Theo Bodrito,‌ advised by J. Ponce and J. Mairal (Inria‌ Grenoble).
Elliot Vincent, advised by J. Ponce and‌ M. Aubry (ENPC).

PhD students

Théotime Le Hellard,‌ started in Oct 2025, advised by J. Carpentier.‌
Franki Nguimatsia Tiofack, started in Jan 2025, advised‌ by J. Carpentier.
Roland Andrews, started in Oct‌ 2024, advised by J. Carpentier and A. Taylor‌ (Sierra).
Imen Mahdi (University of Freiburg), started in‌ Oct 2024, advised by C. Schmid and Abhinav‌ Valada (University of Freiburg).
Romain Seailles, started in‌ Sept 2024, advised by J. Ponce and J.‌ Mairal (Inria Grenoble).
Basile Terver (Meta), started in‌ Nov 2024, advised by J. Ponce and Y.‌ LeCun (Meta).
Shiyao Li (ENPC), started in Oct‌ 2024, advided by S. Chen and V. Lepetit‌ (ENPC).
Lucas Ventura, started in Oct 2022, advised‌ by C. Schmid and G. Varol (ENPC).
Fabian‌ Schramm, started in Feb 2023, advised by J.‌ Carpentier and N. Perrin-Gilbert (ISIR).
Zeeshan Khan, started‌ in Sept 2023, advised by S. Chen and‌ C. Schmid.
Gabriel Fiastre, started in Sept 2023,‌ advised by C. Schmid.
Ludovic de Matteis, started‌ in Oct 2023, advised by J. Carpentier and‌ N. Mansard (CNRS).
U. Bora Gökbakan, started in‌ June 2024, advised by S. Caron and P.‌ Souères (CNRS).
S. Pieri, started in Oct 2024,‌ advised by S. Chen, C. Schmid and J.‌ Sivic (Czech Technical University).
P. Pacaud, started in‌ Oct 2024, advised by S. Chen and C.‌ Schmid.
F. Gardères, started in May 2023, advised‌ by S. Chen and J. Ponce.
F. Porcher,‌ started in May 2025, advised by N. Carion‌ (Meta), K. Alahari (Inria Grenoble) and S. Chen.‌

11.2.3 Juries

PhD committee of Marc Duclusaud, University‌ of Bordeaux, France, December 2025 (S. Caron)
CRCN‌ recruitment jury for Inria Nancy, April 2025 (S.‌ Caron)
Timothée Darcet, Université Grenoble Alpes, July 2025‌ (C. Schmid)
Théo Cachet, Sorbonne université, Paris, June‌ 2025 (C. Schmid)
Kumar Ashutosh, prelim UT Austin,‌ May 2025 (C. Schmid)
Corentin Sautier, ENPC, October‌ 2025 (S. Chen)
Ivan Lopes, Mines Paris –‌ PSL, October 2025 (S. Chen)
Smail Ait Bouhsain,‌ LAAS-CNRS, April 2025 (J. Carpentier)

11.3 Popularization

11.3.1‌ Specific official responsibilities in science outreach structures

"Carte‌ blanche on AI" in Le Monde newspaper with‌ Isabelle Ryl, 6 times a year (J. Ponce)‌
Interview by Micode on Underscore_ on available on‌ Youtube over 310 000 views (J. Ponce)

11.3.2‌ Productions (articles, videos, podcasts, serious games, ...)

Popular‌ science video for the Algorea platform (reach: 10k‌ teachers, 1M students) in collaboration with France-IO (S.‌ Caron)

12 Scientific production

12.1 Major publications

1 miscT.Thomas Chabal‌, S.Shizhe Chen‌, J.Jean Ponce‌‌ and C.Cordelia Schmid. FOM-Nav: Frontier-Object Maps‌ for Object Goal Navigation‌.December 2025HAL‌‌
2 miscS.Shizhe Chen, R.Ricardo‌ Garcia, P.Paul‌ Pacaud and C.Cordelia‌‌ Schmid. Gondola: Grounded Vision Language Planning for‌ Generalizable Robotic Manipulation.‌2025HAL DOI
3‌‌ miscG.Gabriel Fiastre, A.Antoine Yang‌ and C.Cordelia Schmid‌. MaskCaptioner: Learning to‌‌ Jointly Segment and Caption Object Trajectories in Videos‌.October 2025HAL‌
4 miscÜ. B.‌‌Ü. Bora Gökbakan, F.Frederike Dümbgen and‌ S.Stéphane Caron.‌ A Data-driven Contact Estimation‌‌ Method for Wheeled-Biped Robots.January 2025HAL‌
5 miscZ.Zeeshan‌ Khan, S.Shizhe‌‌ Chen and C.Cordelia Schmid. ComposeAnything: Composite‌ Object Priors for Text-to-Image‌ Generation.May 2025‌‌HAL
6 miscT.Théotime Le Hellard,‌ F.Franki Nguimatsia Tiofack‌, Q.Quentin Le‌‌ Lidec and J.Justin Carpentier. Sobolev Diffusion‌ Policy.July 2025‌HAL
7 miscF.‌‌Franki Nguimatsia Tiofack, T.Théotime Le Hellard‌, F.Fabian Schramm‌, N.Nicolas Perrin-Gilbert‌‌ and J.Justin Carpentier. Guided Flow Policy:‌ Learning from High-Value Actions‌ in Offline Reinforcement Learning‌‌.October 2025HAL
8 miscP.Paul‌ Pacaud, R.Ricardo‌ Garcia, S.Shizhe‌‌ Chen and C.Cordelia Schmid. Guardian: Detecting‌ Robotic Planning and Execution‌ Errors with Vision-Language Models‌‌.December 2025HAL
9 miscF.Fabian‌ Schramm, N.Nicolas‌ Perrin-Gilbert and J.Justin‌‌ Carpentier. First-order Sobolev Reinforcement Learning.November‌ 2025HAL
10 inproceedings‌L.Lucas Ventura,‌‌ A.Antoine Yang, C.Cordelia Schmid and‌ G.Gül Varol.‌ Chapter-Llama: Efficient Chaptering in‌‌ Hour-Long Videos with LLMs.CVPR 2025 -‌ IEEE/CVF Conference on Computer‌ Vision and Pattern Recognition‌‌Nashville, United StatesMarch 2025HAL

12.2 Publications‌ of the year

International‌ journals

11 articleA.‌‌Antoine Bambade, F.Fabian Schramm, S.‌ E.Sarah El Kazdadi‌, S.Stéphane Caron‌‌, A.Adrien Taylor and J.Justin Carpentier‌. PROXQP: an Efficient‌ and Versatile Quadratic Programming‌‌ Solver for Real-Time Robotics Applications and Beyond.‌IEEE Transactions on Robotics‌June 2025HAL DOI‌‌back to text
12 articleT. D.Timothy‌ D Barfoot, C.‌Connor Holmes and F.‌‌Frederike Dümbgen. Certifiably optimal rotation and pose‌ estimation based on the‌ Cayley map.The‌‌ International Journal of Robotics Research2025. In‌ press. HAL DOI back‌ to text
13 article‌‌M.Maël Gallois, M.Mégane Millan,‌ N.Nicolas Vignais,‌ S.Sylvain Guégan,‌‌ M.Marie Babel, J.Justin Carpentier and‌ C.Charles Pontonnier.‌ Human-Robot Co-Simulation method for‌‌ upper limb assistive force calculation using polytopes.‌Multidisciplinary Biomechanics Journal2‌2025, 287–289HAL‌‌DOI back to text
14 articleW.Wilson‌ Jallet, A.Antoine‌ Bambade, E.Etienne‌‌ Arlaud, S.Sarah‌ El-Kazdadi, N.Nicolas Mansard and J.Justin‌ Carpentier. PROXDDP: Proximal Constrained Trajectory Optimization.‌IEEE Transactions on Robotics41March 2025,‌ 2605 - 2624HALDOI back to text‌
15 articleA.Armand Jordana, S.Sébastien‌ Kleff, A.Avadesh Meduri, J.Justin‌ Carpentier, N.Nicolas Mansard and L.Ludovic‌ Righetti. Structure-Exploiting Sequential Quadratic Programming for Model-Predictive‌ Control.IEEE Transactions on RoboticsAugust 2025‌HAL DOI back to text
16 articleT.‌Tanguy Navez, E.Etienne Ménager, P.‌Paul Chaillou, O.Olivier Goury, A.‌Alexandre Kruszewski and C.Christian Duriez. Modeling,‌ Embedded Control and Design of Soft Robots using‌ a Learned Condensed FEM Model.IEEE Transactions‌ on Robotics41March 2025, 2441-2459HAL‌DOI back to text

National journals

17 article‌A.Armand Jordana, S.Sébastien Kleff,‌ A.Arthur Haffemayer, J.Joaquim Ortiz-Haro,‌ J.Justin Carpentier, N.Nicolas Mansard and‌ L.Ludovic Righetti. Infinite-Horizon Value Function Approximation‌ for Model Predictive Control.IEEE Robotics and‌ Automation LettersJune 2025HAL DOI back to‌ text

International peer-reviewed conferences

18 inproceedingsV.Virgile‌ Batto, L.Ludovic de Matteis and N.‌Nicolas Mansard. Extended URDF: Accounting for parallel‌ mechanism in robot description.RAAD 2025 -‌ 34th International Conference on Robotics in Alpe-Adria-Danube Region‌Belgrade, SerbiaJune 2025HAL back to text‌
19 inproceedingsT.Théo Bodrito, O.Olivier‌ Flasseur, J.Julien Mairal, J.Jean‌ Ponce, M.Maud Langlois and A.-M.Anne-Marie‌ Lagrange. A New Statistical Model of Star‌ Speckles for Learning to Detect and Characterize Exoplanets‌ in Direct Imaging Observations.CVPR 2025 -‌ IEEE / CVF Conference on Computer Vision and‌ Pattern RecognitionNashville, United StatesIEEE2025,‌ 1-15HAL back to text
20 inproceedingsT.‌Théo Bodrito, O.Olivier Flasseur, J.‌Julien Mairal, J.Jean Ponce, M.‌Maud Langlois and A.-M.Anne-Marie Lagrange. Deep‌ learning for exoplanet detection and characterization by direct‌ imaging at high contrast.SF2A 2025 -‌ Journées de la Société Française d’Astronomie & d’Astrophysique‌Toulouse, France2025, 1-5HAL back to‌ text
21 inproceedingsT.Théo Bodrito, O.‌Olivier Flasseur, J.Julien Mairal, J.‌Jean Ponce, M.Maud Langlois and A.-M.‌Anne-Marie Lagrange. Joint statistical modeling and deep‌ learning for exoplanet detection and characterization by direct‌ imaging at high contrast.EPSC-DPS Joint Meeting‌ 2025Helsinki, Finland2025, 1-47HAL back‌ to text
22 inproceedingsT.Théo Bodrito,‌ O.Olivier Flasseur, J.Julien Mairal,‌ J.Jean Ponce, M.Maud Langlois and‌ A.-M.Anne-Marie Lagrange. Modèle statistique apprenable de‌ mélange de distributions et fusion de données multivariées‌ pour l'imagerie d'exoplanètes.GRETSI 2025 - XXXe‌ Colloque Francophone de Traitement du Signal et des‌ ImagesStrasbourg, FranceAugust 2025, 1-4HALback to text
23‌ inproceedingsT.Thomas Chabal‌, S.Shizhe Chen‌‌, J.Jean Ponce and C.Cordelia Schmid‌. Online 3D Scene‌ Reconstruction Using Neural Object‌‌ Priors.3DV 2025 - 12th International Conference‌ on 3D Vision 2025‌Singapore, SingaporeMarch 2025‌‌HAL back to text
24 inproceedingsM.Matthieu‌ Futeral, C.Cordelia‌ Schmid, B.Benoît‌‌ Sagot and R.Rachel Bawden. Towards Zero-Shot‌ Multimodal Machine Translation.‌Findings of the Association‌‌ for Computational Linguistics: NAACL 2025Findings of the‌ Association for Computational Linguistics:‌ NAACL 2025Albuquerque, New‌‌ Mexico, United States2025, 761–778HAL back‌ to text
25 inproceedings‌M.Matthieu Futeral,‌‌ A.Armel Zebaze, P. O.Pedro Ortiz‌ Suarez, J.Julien‌ Abadji, R.Rémi‌‌ Lacroix, C.Cordelia Schmid, R.Rachel‌ Bawden and B.Benoît‌ Sagot. mOSCAR: A‌‌ Large-scale Multilingual and Multimodal Document-level Corpus.ACL‌ 2025 - Findings of‌ the Association for Computational‌‌ LinguisticsVienna, AustriaJuly 2025, 3461–3494HAL‌DOI back to text‌
26 inproceedingsA.Abhishek‌‌ Kuriyal, E.Elliot Vincent, M.Mathieu‌ Aubry and L.Loic‌ Landrieu. CoDEx: Combining‌‌ Domain Expertise for Spatial Generalization in Satellite Image‌ Analysis.Proceedings of‌ the Computer Vision and‌‌ Pattern Recognition Conference (CVPR) WorkshopsCVPR 2025 EarthVision‌ Workshop - The IEEE/CVF‌ Conference on Computer Vision‌‌ and Pattern RecognitionNashville (Tennessee), United States2025‌, 2194-2203HAL back‌ to text
27 inproceedings‌‌S.Shiyao Li, A.Antoine Guédon,‌ C.Clémentin Boittiaux,‌ S.Shizhe Chen and‌‌ V.Vincent Lepetit. NextBestPath: Efficient 3D Mapping‌ of Unseen Environments.‌13th International Conference on‌‌ Learning Representations (ICLR 2025)ICLR 2025 - Thirteenth‌ International Conference on Learning‌ RepresentationsSingapore, SingaporeApril‌‌ 2025HAL back to text
28 inproceedingsL.‌Ludovic de Matteis,‌ V.Virgile Batto,‌‌ J.Justin Carpentier and N.Nicolas Mansard.‌ Optimal Control of Walkers‌ with Parallel Actuation.‌‌2024 IEEE/RSJ International Conference on Intelligent Robots and‌ Systems (IROS)Hangzhou, China‌October 2025HAL back‌‌ to text
29 inproceedingsE.Etienne Ménager,‌ L.Louis Montaut,‌ Q.Quentin Le Lidec‌‌ and J.Justin Carpentier. Differentiable Simulation of‌ Soft Robots with Frictional‌ Contacts.RoboSoft 2025‌‌ - 8th IEEE-RAS International Conference on Soft Robotics‌Lausanne, SwitzerlandApril 2025‌HAL back to text‌‌
30 inproceedingsY.Yann de Mont-Marin, L.‌Louis Montaut, J.‌Jean Ponce, M.‌‌Martial Hebert and J.Justin Carpentier. On‌ the Conic Complementarity of‌ Planar Contacts.IEEE‌‌ International Conference on Robotics & AutomationViennes, Austria‌June 2026HAL back‌ to text
31 inproceedings‌‌L.Lucas Ventura, A.Antoine Yang,‌ C.Cordelia Schmid and‌ G.Gül Varol.‌‌ Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs‌.CVPR 2025 -‌ IEEE/CVF Conference on Computer‌‌ Vision and Pattern RecognitionNashville, United StatesMarch‌ 2025HAL back to‌ text
32 inproceedingsE.‌‌Elliot Vincent, M.‌Mehraïl Saroufim, J.Jonathan Chemla, Y.‌Yves Ubelmann, P.Philippe Marquis, J.‌Jean Ponce and M.Mathieu Aubry. Detecting‌ Looted Archaeological Sites from Satellite Image Time Series‌.Proceedings of the Computer Vision and Pattern‌ Recognition ConferenceCVPR 2025 EarthVision Workshop - The‌ IEEE/CVF Conference on Computer Vision and Pattern Recognition‌Nashville (Tenessee), United States2025, 2296-2307HAL‌DOI back to text

Conferences without proceedings

33‌ inproceedingsF.François Gardères, S.Shizhe Chen‌, C.-S.Camille-Sovanneary Gauthier and J.Jean Ponce‌. FACap: A Large-Scale Fashion Dataset for Fine-grained‌ Composed Image Retrieval.GENNEXT@SIGIR 2025- 1st Workshop‌ on Next Generation of IR and Recommender Systems‌ with Language Agents, Generative Models, and Conversational AI‌Padova, ItalyJuly 2025HAL back to text‌
34 inproceedingsF.Franki Nguimatsia Tiofack, T.‌Théotime Le Hellard, F.Fabian Schramm,‌ N.Nicolas Perrin-Gilbert and J.Justin Carpentier.‌ Guided Flow Policy: Learning from High-Value Actions in‌ Offline Reinforcement Learning.ICLR 2026 - The‌ Fourteenth International Conference on Learning RepresentationsRio de‌ Janeiro, BrazilApril 2026HAL back to text‌
35 inproceedingsF.Fabian Schramm, P.Pierre‌ Fabre, N.Nicolas Perrin-Gilbert and J.Justin‌ Carpentier. Reference-Free Sampling-Based Model Predictive Control.‌ICRA 2026 - IEEE International Conference on Robotics‌ and AutomationVienna, AustriaNovember 2025HAL back‌ to text

Edition (books, proceedings, special issue of‌ a journal)

36 proceedingsOptimal transport unlocks end-to-end‌ learning for single-molecule localization.ICLR 2026: The‌ Fourteenth International Conference on Learning RepresentationsRio de‌ Janeiro, Brazil2026. In press. HAL back‌ to text

Doctoral dissertations and habilitation theses

37‌ thesisT.Théo Bodrito. Deep learning for‌ exoplanet detection in high contrast imaging.Ecole‌ Normale Supérieure - PSLJune 2025HAL back‌ to text
38 thesisT.Thomas Chabal.‌ Object-centric representations for sensing and planning in visually-guided‌ robotics.Ecole normale superieureSeptember 2025HAL‌back to text
39 thesisZ.Zerui Chen‌. Learning dexterous manipulation from 3D hand and‌ object interaction.Université Paris sciences et lettres‌May 2025HAL back to text
40 thesis‌R.Ricardo Garcia-Pinel. Learning Visuomotor Policies for‌ Robotic Manipulation.Ecole normale Supérieur - PSL‌June 2025HAL back to text
41 thesis‌E.Elliot Vincent. Analysis of satellite image‌ time series for classification and change detection.‌École des Ponts ParisTechMay 2025HAL back‌ to text

Reports & preprints

42 miscT.‌Thomas Chabal, S.Shizhe Chen, J.‌Jean Ponce and C.Cordelia Schmid. FOM-Nav:‌ Frontier-Object Maps for Object Goal Navigation.December‌ 2025HAL back to text
43 miscS.‌Shizhe Chen, R.Ricardo Garcia, P.‌Paul Pacaud and C.Cordelia Schmid. Gondola:‌ Grounded Vision Language Planning for Generalizable Robotic Manipulation‌.2025HAL DOIback to text
44‌ miscG.Gabriel Fiastre, A.Antoine Yang and C.Cordelia Schmid‌. MaskCaptioner: Learning to‌ Jointly Segment and Caption‌‌ Object Trajectories in Videos.October 2025HAL‌back to text
45‌ miscÜ. B.Ü.‌‌ Bora Gökbakan, F.Frederike Dümbgen and S.‌Stéphane Caron. A‌ Data-driven Contact Estimation Method‌‌ for Wheeled-Biped Robots.January 2025HAL back‌ to text back to‌ text
46 miscA.‌‌Antoine Groudiev, F.Fabian Schramm, E.‌Eloïse Berthier, J.‌Justin Carpentier and F.‌‌Frederike Dümbgen. KernelSOS for Global Sampling-Based Optimal‌ Control and Estimation via‌ Semidefinite Programming.July‌‌ 2025HAL back to text
47 miscZ.‌Zeeshan Khan, S.‌Shizhe Chen and C.‌‌Cordelia Schmid. ComposeAnything: Composite Object Priors for‌ Text-to-Image Generation.May‌ 2025HAL back to‌‌ text
48 miscT.Théotime Le Hellard,‌ F.Franki Nguimatsia Tiofack‌, Q.Quentin Le‌‌ Lidec and J.Justin Carpentier. Sobolev Diffusion‌ Policy.July 2025‌HAL back to text‌‌
49 miscQ.Quentin Le Lidec, L.‌Louis Montaut, Y.‌Yann de Mont-Marin,‌‌ F.Fabian Schramm and J.Justin Carpentier.‌ End-to-End and Highly-Efficient Differentiable‌ Simulation for Robotics.‌‌May 2025HAL back to text
50 misc‌V.Victor Lutz,‌ L.Ludovic de Matteis‌‌, V.Virgile Batto and N.Nicolas Mansard‌. Control of Humanoid‌ Robots with Parallel Mechanisms‌‌ using Differential Actuation Models.March 2025HAL‌back to text
51‌ miscE.Etienne Ménager‌‌, A.Antoine Bambade, W.Wilson Jallet‌, A.Alberto de‌ Marchi and J.Justin‌‌ Carpentier. Contact-Implicit Inverse Dynamics.August 2025‌HAL back to text‌
52 miscP.Paul‌‌ Pacaud, R.Ricardo Garcia, S.Shizhe‌ Chen and C.Cordelia‌ Schmid. Guardian: Detecting‌‌ Robotic Planning and Execution Errors with Vision-Language Models‌.December 2025HAL‌back to text
53‌‌ miscJ.Jean Ponce, B.Basile Terver‌, M.Martial Hebert‌ and M.Michael Arbel‌‌. Dual Perspectives on Non-Contrastive Self-Supervised Learning.‌February 2026HAL back‌ to text
54 misc‌‌A. S.Ajay Suresha Sathya and J.Justin‌ Carpentier. Constrained Articulated‌ Body Algorithms for Closed-Loop‌‌ Mechanisms.December 2025HAL back to text‌
55 miscF.Fabian‌ Schramm, N.Nicolas‌‌ Perrin-Gilbert and J.Justin Carpentier. First-order Sobolev‌ Reinforcement Learning.November‌ 2025HAL back to‌‌ text
56 miscF.Fabian Schramm, F.‌ N.Franki Nguimatsia Tiofack‌, N.Nicolas Perrin-Gilbert‌‌, M.Marc Toussaint and J.Justin Carpentier‌. Variance-Reduced Model Predictive‌ Path Integral via Quadratic‌‌ Model Approximation.February 2026HAL
57 misc‌A.Ajay Suresha Sathya‌, L.Louis Montaut‌‌, Y.Yann de Mont-Marin and J.Justin‌ Carpentier. Matrix-Free Delassus‌ Operations: Scalable and Memory-Efficient‌‌ Algorithms.January 2026HAL
58 miscV.‌Valentin Tordjman--Levavasseur and S.‌Stéphane Caron. Collision‌‌ avoidance from monocular vision trained with novel view‌ synthesis.March 2025‌HAL back to text‌‌

Other scientific publications

59‌ inproceedingsT.Théo Bodrito, O.Olivier Flasseur‌, J.Julien Mairal, J.Jean Ponce‌, M.Maud Langlois and A.-M.Anne-Marie Lagrange‌. Deep learning for exoplanet detection and characterization‌ by direct imaging at high contrast.JDLS‌ 2025 : Troisième Journée Deep Learning For Science‌Paris, France2025, 1-1HAL
60 misc‌E.Etienne Ménager and J.Justin Carpentier.‌ Frictional Contact Solving for Material Point Method.‌February 2026HAL

WILLOW - 2025

WILLOW - 2025

2025Activity​‌﻿﻿ reportProject-TeamWILLOW

Keywords

Computer Science﻿​​﻿ and Digital Science

Other﻿﻿﻿‌ Research Topics and Application﻿‌​‌ Domains

1​​​‌ Team members, visitors, external﻿﻿﻿‌ collaborators

Research Scientists

Post-Doctoral Fellows​​​‌

PhD Students

Technical Staff

Interns and​‌﻿﻿ Apprentices

Administrative Assistant

Visiting Scientists

External Collaborators​​﻿﻿

2 Overall objectives​​​‌

2.1 Statement

3 Research​​﻿﻿ program

3.1 Visual recognition​​​‌ and reconstruction of images﻿​﻿﻿ and videos

3.2 Learning embodied representations​‌﻿﻿

3.3 Image restoration​‌﻿﻿ and enhancement

4 Application​​​‌ domains

4.1﻿‌​‌ Automated visual assistants

4.2 Robotics﻿‌​‌

4.3 Image restoration﻿﻿﻿‌

5﻿​﻿﻿ Social and environmental responsibility​‌﻿﻿

6​​﻿﻿ Highlights of the year​​​‌

6.1 Awards

7​​​‌ Latest software developments, platforms,﻿﻿﻿‌ open data

7.1 Latest﻿‌​‌ software developments

7.1.1 alignsdf﻿​​﻿

7.1.2 BLERC

7.1.3﻿‌​‌ BurstSR

7.1.4﻿﻿﻿‌ FrozenBiLM

7.1.5​​​‌ hiveformer

7.1.6​​​‌ HM3DAutoVLN

7.1.7﻿​​﻿ Just Ask: Learning to​​​‌ Answer Questions from Millions﻿﻿﻿‌ of Narrated Videos

7.1.8 Pinocchio

7.1.9 ProxSuite﻿‌​‌

7.1.10 SPE

7.1.11 TubeDETR​​﻿﻿

7.1.12 vil3dref

7.1.13 VLN-DUET

7.2﻿​﻿﻿ New platforms

7.3 Open data

8﻿​​﻿ New results

8.1 Visual​​​‌ recognition and reconstruction of﻿﻿﻿‌ images and videos

8.1.1﻿‌​‌ MaskCaptioner: Learning to Jointly﻿​​﻿ Segment and Caption Object​​​‌ Trajectories in Videos

8.1.2﻿​​﻿ ComposeAnything: Composite Object Priors​​​‌ for Text-to-Image Generation

8.1.3 FACap: A​​​‌ Large-Scale Fashion Dataset for﻿﻿﻿‌ Fine-grained Composed Image Retrieval﻿‌​‌

8.1.4​​​‌ Chapter-Llama: Efficient Chaptering in﻿​﻿﻿ Hour-Long Videos with LLMs​‌﻿﻿

8.1.5 Online 3D Scene​‌﻿﻿ Reconstruction Using Neural Object​​﻿﻿ Priors

8.1.6 Detecting Looted﻿​​﻿ Archaeological Sites from Satellite​​​‌ Image Time Series

8.1.7 Towards Zero-Shot Multimodal​​​‌ Machine Translation

8.1.8 mOSCAR: A​​​‌ Large-scale Multilingual and Multimodal﻿​﻿﻿ Document-level Corpus

8.1.9 Dual Perspectives on﻿﻿﻿‌ Non-Contrastive Self-Supervised Learning

8.1.10 Optimal transport﻿​​﻿ unlocks end-to-end learning for​​​‌ single-molecule localization

8.2﻿﻿﻿‌ Learning embodied representations

8.2.1﻿‌​‌ NextBestPath: Efficient 3D Mapping﻿​​﻿ of Unseen Environments

8.2.2 FOM-Nav: Frontier-Object Maps​‌﻿﻿ for Object Goal Navigation​​﻿﻿

8.2.3 Gondola: Grounded Vision​​​‌ Language Planning for Generalizable﻿​﻿﻿ Robotic Manipulation

8.2.4 Collision avoidance from﻿‌​‌ monocular vision trained with﻿​​﻿ novel view synthesis

8.2.5﻿‌​‌ KernelSOS for Global Sampling-Based﻿​​﻿ Optimal Control and Estimation​​​‌ via Semidefinite Programming

8.2.6 Sobolev Diffusion Policy﻿​﻿﻿

8.2.7 First-order​​﻿﻿ Sobolev Reinforcement Learning

8.2.8﻿‌​‌ Control of Humanoid Robots﻿​​﻿ with Parallel Mechanisms using​​​‌ Differential Actuation Models

8.2.9 On the Conic﻿‌​‌ Complementarity of Planar Contacts﻿​​﻿

8.2.10 Reference-Free Sampling-Based​‌﻿﻿ Model Predictive Control

8.2.11​‌﻿﻿ Guided Flow Policy: Learning​​﻿﻿ from High-Value Actions in​​​‌ Offline Reinforcement Learning

8.2.12 Contact-Implicit﻿‌​‌ Inverse Dynamics

8.3 A Data-driven Contact​​​‌ Estimation Method for Wheeled-Biped﻿​﻿﻿ Robots

8.3.1 End-to-End﻿​﻿﻿ and Highly-Efficient Differentiable Simulation​‌﻿﻿ for Robotics

8.3.2﻿​​﻿ Differentiable Simulation of Soft​​​‌ Robots with Frictional Contacts﻿﻿﻿‌

8.3.3 Constrained﻿‌​‌ Articulated Body Algorithms for﻿​​﻿ Closed-Loop Mechanisms

8.3.4 A Data-driven​​​‌ Contact Estimation Method for﻿﻿﻿‌ Wheeled-Biped Robots

8.3.5​​﻿﻿ Guardian: Detecting Robotic Planning​​​‌ and Execution Errors with﻿​﻿﻿ Vision-Language Models

8.3.6﻿​​﻿ Augmented Lagrangian methods for​​​‌ infeasible convex optimization problems﻿﻿﻿‌ and diverging proximal-point algorithms﻿‌​‌

8.3.7 Certifiably optimal rotation​​​‌ and pose estimation based﻿﻿﻿‌ on the Cayley map﻿‌​‌

8.3.8 Human-Robot﻿‌​‌ Co-Simulation method for upper﻿​​﻿ limb assistive force calculation​​​‌ using polytopes

8.3.9﻿​﻿﻿ PROXQP: an Efficient and​‌﻿﻿ Versatile Quadratic Programming Solver​​﻿﻿ for Real-Time Robotics Applications​​​‌ and Beyond

2025Activity‌ reportProject-TeamWILLOW

Computer Science and Digital Science

Other‌ Research Topics and Application‌‌ Domains

1‌ Team members, visitors, external‌ collaborators

Post-Doctoral Fellows‌

Interns and‌ Apprentices

External Collaborators

2 Overall objectives‌

3 Research program

3.1 Visual recognition‌ and reconstruction of images and videos

3.2 Learning embodied representations‌

3.3 Image restoration‌ and enhancement

4 Application‌ domains

4.1‌‌ Automated visual assistants

4.2 Robotics‌‌

4.3 Image restoration‌

5 Social and environmental responsibility‌

6 Highlights of the year‌

7‌ Latest software developments, platforms,‌ open data

7.1 Latest‌‌ software developments

7.1.1 alignsdf

7.1.3‌‌ BurstSR

7.1.4‌ FrozenBiLM

7.1.5‌ hiveformer

7.1.6‌ HM3DAutoVLN

7.1.7 Just Ask: Learning to‌ Answer Questions from Millions‌ of Narrated Videos

7.1.9 ProxSuite‌‌

7.1.11 TubeDETR

7.2 New platforms

8 New results

8.1 Visual‌ recognition and reconstruction of‌ images and videos

8.1.1‌‌ MaskCaptioner: Learning to Jointly Segment and Caption Object‌ Trajectories in Videos

8.1.2 ComposeAnything: Composite Object Priors‌ for Text-to-Image Generation

8.1.3 FACap: A‌ Large-Scale Fashion Dataset for‌ Fine-grained Composed Image Retrieval‌‌

8.1.4‌ Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs‌

8.1.5 Online 3D Scene‌ Reconstruction Using Neural Object Priors

8.1.6 Detecting Looted Archaeological Sites from Satellite‌ Image Time Series

8.1.7 Towards Zero-Shot Multimodal‌ Machine Translation

8.1.8 mOSCAR: A‌ Large-scale Multilingual and Multimodal Document-level Corpus

8.1.9 Dual Perspectives on‌ Non-Contrastive Self-Supervised Learning

8.1.10 Optimal transport unlocks end-to-end learning for‌ single-molecule localization

8.2‌ Learning embodied representations

8.2.1‌‌ NextBestPath: Efficient 3D Mapping of Unseen Environments

8.2.2 FOM-Nav: Frontier-Object Maps‌ for Object Goal Navigation

8.2.3 Gondola: Grounded Vision‌ Language Planning for Generalizable Robotic Manipulation

8.2.4 Collision avoidance from‌‌ monocular vision trained with novel view synthesis

8.2.5‌‌ KernelSOS for Global Sampling-Based Optimal Control and Estimation‌ via Semidefinite Programming

8.2.6 Sobolev Diffusion Policy

8.2.7 First-order Sobolev Reinforcement Learning

8.2.8‌‌ Control of Humanoid Robots with Parallel Mechanisms using‌ Differential Actuation Models

8.2.9 On the Conic‌‌ Complementarity of Planar Contacts

8.2.10 Reference-Free Sampling-Based‌ Model Predictive Control

8.2.11‌ Guided Flow Policy: Learning from High-Value Actions in‌ Offline Reinforcement Learning

8.2.12 Contact-Implicit‌‌ Inverse Dynamics

8.3 A Data-driven Contact‌ Estimation Method for Wheeled-Biped Robots

8.3.1 End-to-End and Highly-Efficient Differentiable Simulation‌ for Robotics

8.3.2 Differentiable Simulation of Soft‌ Robots with Frictional Contacts‌

8.3.3 Constrained‌‌ Articulated Body Algorithms for Closed-Loop Mechanisms

8.3.4 A Data-driven‌ Contact Estimation Method for‌ Wheeled-Biped Robots

8.3.5 Guardian: Detecting Robotic Planning‌ and Execution Errors with Vision-Language Models

8.3.6 Augmented Lagrangian methods for‌ infeasible convex optimization problems‌ and diverging proximal-point algorithms‌‌

8.3.7 Certifiably optimal rotation‌ and pose estimation based‌ on the Cayley map‌‌

8.3.8 Human-Robot‌‌ Co-Simulation method for upper limb assistive force calculation‌ using polytopes

8.3.9 PROXQP: an Efficient and‌ Versatile Quadratic Programming Solver for Real-Time Robotics Applications‌ and Beyond

8.3.10 Optimal Control‌ of Walkers with Parallel Actuation

8.3.11 Extended‌‌ URDF: Accounting for parallel mechanism in robot description‌

8.3.12 PROXDDP: Proximal‌ Constrained Trajectory Optimization

8.3.13 Structure-Exploiting‌ Sequential Quadratic Programming for‌ Model-Predictive Control

8.3.14 Modeling, Embedded Control and Design of Soft‌ Robots using a Learned Condensed FEM Model

8.3.15 Infinite-Horizon Value Function‌ Approximation for Model Predictive Control

8.4‌‌ Image restoration and enhancement

8.4.1 A New Statistical‌ Model of Star Speckles‌ for Learning to Detect‌‌ and Characterize Exoplanets in Direct Imaging Observations

8.4.2 Deep‌‌ learning for exoplanet detection and characterization by direct‌ imaging at high contrast‌

8.4.3 Joint statistical‌‌ modeling and deep learning for exoplanet detection and‌ characterization by direct imaging‌ at high contrast

8.4.4 Modèle‌ statistique apprenable de mélange de distributions et fusion‌ de données multivariées pour l'imagerie d'exoplanètes

8.4.5 CoDEx: Combining Domain Expertise for‌ Spatial Generalization in Satellite Image Analysis

8.5‌‌ Doctoral dissertations and habilitation theses

8.5.1 Deep learning‌ for exoplanet detection in‌ high contrast imaging

8.5.2 Learning dexterous manipulation from‌ 3D hand and object‌ interaction

8.5.3 Analysis of satellite image time series for‌ classification and change detection