2025Activity reportProject-TeamWILLOW
RNSR: 200718311C- Research center Inria Paris Centre
- In partnership with:Ecole normale supérieure de Paris, CNRS
- Team name: Embodied computer vision
- In collaboration with:Département d'Informatique de l'Ecole Normale Supérieure
Creation of the Project-Team: 2021 May 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A3.1.1. Modeling, representation
- A3.4. Machine learning and statistics
- A5.3. Image processing and analysis
- A5.10. Robotics
- A9. Artificial intelligence
- A9.1. Knowledge
- A9.2. Machine learning
- A9.3. Signal processing
- A9.5. Robotics and AI
- A9.7. AI algorithmics
- A9.12. Computer vision
Other Research Topics and Application Domains
- B9.5.1. Computer science
- B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
- Justin Carpentier [Team leader, INRIA, Researcher]
- Stephane Caron [INRIA, Associate Professor Detachement, HDR]
- Shizhe Chen [INRIA, Researcher]
- Yann Dubois De Mont-Marin [INRIA, Starting Research Position, from May 2025]
- Frederike Dumbgen [INRIA, Starting Research Position, until Apr 2025]
- Wilson Jallet [INRIA, Starting Research Position, from May 2025]
- Louis Montaut [INRIA, Starting Research Position, from May 2025]
- Jean Ponce [ENS PARIS, Senior Researcher, HDR]
- Cordelia Schmid [INRIA, Senior Researcher, HDR]
Post-Doctoral Fellows
- Ewen Dantec [ENS Paris, Post-Doctoral Fellow, until Nov 2025]
- Yann Dubois De Mont-Marin [INRIA, Post-Doctoral Fellow, until Apr 2025]
- Frederike Dumbgen [INRIA, Post-Doctoral Fellow, from May 2025]
- Wilson Jallet [INRIA, Post-Doctoral Fellow, until Apr 2025]
- Quentin Le Lidec [INRIA, Post-Doctoral Fellow, until May 2025]
- Etienne Menager [INRIA, Post-Doctoral Fellow]
- Louis Montaut [INRIA, Post-Doctoral Fellow, until Apr 2025]
- Etienne Moullet [INRIA, Post-Doctoral Fellow, until Sep 2025]
- Ajay Sathya [INRIA, Post-Doctoral Fellow]
PhD Students
- Roland Andrews [INRIA]
- Theo Bodrito [INRIA, until Apr 2025]
- Thomas Chabal [INRIA, until Aug 2025]
- Zerui Chen [INRIA, until Apr 2025]
- Ludovic De Matteis [UNIV TOULOUSE III, until Apr 2025]
- Gabriel Fiastre [INRIA]
- Matthieu Futeral-Peter [INRIA, until Jun 2025]
- Ricardo Garcia Pinel [INRIA, until Jun 2025]
- Francois Garderes [LOUIS VUITTON, CIFRE]
- Umit Bora Gokbakan [INRIA]
- Zeeshan Khan [INRIA]
- Theotime Le Hellard [INRIA, from Oct 2025]
- Shiyao Li [ENPC]
- Javier Alejandro Lopetegui Gonzalez [INRIA, from Nov 2025]
- Imen Mahdi [University of Freiburg]
- Franki Nguimatsia Tiofack [INRIA]
- Paul Pacaud [INRIA]
- Sara Pieri [INRIA]
- François Porcher [Meta, CIFRE]
- Mathis Scheffler [INRIA, from Oct 2025]
- Fabian Schramm [INRIA]
- Romain Seailles [ENS Paris]
- Federica Spinola [INRIA, from Oct 2025]
- Basile Terver [FACEBOOK, CIFRE]
- Valentin Tordjman–Levavasseur [INRIA, from Oct 2025]
- Lucas Ventura [ENPC]
- Elliot Vincent [Ministère Transition, until Jun 2025]
Technical Staff
- Etienne Arlaud [INRIA, Engineer]
- Walid Bousselham [INRIA, Engineer, from May 2025 until Jun 2025]
- Riccardo Cadei [INRIA, Engineer, from Mar 2025 until Jul 2025]
- Timothee Carecchio [INRIA, Engineer, from Sep 2025]
- Aamr El Kazdadi [INRIA, Engineer, from Oct 2025]
- Pierre Fabre [INRIA, Engineer, from Jun 2025]
- Lucas Haubert [INRIA, Engineer]
- Qikai Huang [UNIV GEORGIA, Engineer, from Dec 2025]
- Peteris Kulits [INRIA, Engineer, from Jun 2025 until Nov 2025]
- Louise Manson [INRIA, Engineer]
- Jeanne Matheron [INRIA, from Nov 2025]
- Megane Millan [INRIA, Engineer, until Oct 2025]
- Louis Nel [INRIA, Engineer, from Dec 2025]
- Valentin Tordjman–Levavasseur [INRIA, until Sep 2025]
- Joris Vaillant [INRIA, Engineer]
Interns and Apprentices
- Joaquin Austin Ferro [INRIA, Apprentice, from Sep 2025]
- Timothee Carecchio [INRIA, Intern, from Feb 2025 until Aug 2025]
- Radu Cristian [INRIA, Intern, from Mar 2025 until Aug 2025]
- Theotime Le Hellard [INRIA, from Sep 2025 until Sep 2025]
- Theotime Le Hellard [ENS Paris, Intern, from Apr 2025 until Aug 2025]
- Romain Li [INRIA, Intern, from Jul 2025 until Sep 2025]
- Armand Modjabi [INRIA, Intern, from Mar 2025 until Aug 2025]
- Louis Nel [INRIA, Intern, from Jun 2025 until Nov 2025]
- Matthieu Rouet [ENS Paris, Intern, from Jun 2025 until Aug 2025]
- Mathis Scheffler [INRIA, Intern, from Apr 2025 until Sep 2025]
Administrative Assistant
- Marina Kovacic [INRIA]
Visiting Scientists
- Qikai Huang [UNIV GEORGIA, until Nov 2025]
- Kim Jae Myung [UNIV TUBINGEN, until May 2025]
- Victor Klemm [INSTITUT ETH, from May 2025 until Jun 2025]
- Michael Tarr [CMU, from Sep 2025 until Oct 2025]
- Baohe Zhang [ University of Freiburg, from Aug 2025 until Nov 2025]
External Collaborators
- Antoine Hoarau [Self-employee, from Oct 2025]
- Theotime Le Hellard [ENS de Paris, until Apr 2025]
2 Overall objectives
2.1 Statement
Building machines that can automatically understand complex visual inputs is one of the central scientific challenges in artificial intelligence. Truly successful visual understanding technology will have a broad impact in application domains as varied as defense, entertainment, healthcare, human-computer interaction, image retrieval and data mining, industrial and personal robotics, manufacturing, scientific image analysis, surveillance and security, and transportation.
The problem is, however, very difficult due to the large variability of the visual world and the high complexity of the underling physical phenomena. For example, people easily learn how to perform complex tasks such as changing a car tire or performing resuscitation by observing other people. This involves advanced visual perception and interaction capabilities including interpreting sequences of human actions, learning new visuomotor skills from only a few example demonstrations, grounding instructions in appropriate scene elements and actions, and applying the learned skills in new environments and situations. Currently, however, there is no artificial system with a similar level of cognitive visual competence. Our goal for the next 10 years is to develop models, methods and algorithms providing sufficient level of visual intelligence to enable applications such as personal visual assistants or home robots that will, for example, prepare a meal in response to a chat request.
Despite the tremendous progress in visual recognition in the last decade, current visual recognition systems still require large amounts of carefully annotated training data, often use black-box architectures that do not model the 3D physical nature of the visual world, are typically limited to simple pattern recognition tasks such as detecting and recognizing objects from a predefined vocabulary, and do not capture real-world semantics. We plan to address these limitations with an ambitious research program that aims at developing models of the entire visual understanding process from image acquisition to the high-level embodied interpretation of visual scenes. We target learnable models that require minimal to no supervision, support complex reasoning about visual data, and are grounded in interactions with the physical world. More concretely, we will address fundamental scientific challenges along three research axes: (i) visual recognition in images and videos with an emphasis on weakly supervised learning and models grounded in the physical 3D world; (ii) learning embodied visual representations for robotic manipulation and locomotion; and (iii) image restoration and enhancement. These challenges will be tackled by a team of researchers with core expertise in computer vision and robotics, who will simultaneously advance both fields towards convergence. The complementary expertise in areas such as machine learning and natural language understanding will be gained through collaboration with relevant research teams.
We believe that foundational research should be grounded in applications and we plan to pursue applications with high scientific, societal, and/or economic impact in domains such as transportation; augmented reality; education; advanced manufacturing; and quantitative visual analysis in sciences, humanities and healthcare.
3 Research program
3.1 Visual recognition and reconstruction of images and videos
It is now possible to efficiently detect individual objects and people in cluttered images and videos. Current methods, however, rely on large-scale, manually-annotated image collections, often use black-box architectures that do not model the 3D physical nature of the visual world, and are typically limited to simple pattern recognition tasks. In this part of research program, we address these fundamental limitations. In particular, we address the following three key open challenges: (i) how to leverage available but weak annotations including text, audio and speech, (ii) how to enable automatic reasoning about visual data, and (iii) how to develop models grounded in the physical 3D world including learnable models for 3D object and scene reconstruction. We also continue theoretical work aimed at understanding the geometric underpinnings of computer vision.
Our current efforts in this area are outlined in detail in Section. 8.1.
3.2 Learning embodied representations
Computer vision has come a long way toward understanding images and videos in terms of scene geometry, object labels, locations and poses of people or classes of human actions. This “understanding”, however, remains largely disconnected from reasoning about the physical world. For example, what will happen when removing a tablecloth from a set table? What actions will be needed to resume an interrupted meal? We believe that a true embodied understanding of dynamic scenes from visual observations is the next major research challenge. We address this challenge by developing new models and algorithms with an emphasis on the synergy between vision, learning, robotics and natural language understanding. To this end, we study learning methods for motion planning and optimal control for known environments in state space. At the same time, we develop models and algorithms for learning visio-motor policies that do not rely on the known structure of environments and instead integrate visual perception directly into control algorithms. We also address natural language providing additional modality for more efficient learning and communication with emodied agents.
Our current efforts in this area are outlined in detail in Section 8.2.
3.3 Image restoration and enhancement
Although image processing is a mature field, it is more important than ever with the advent of high-quality camera phones, scientific applications in microscopy and astronomy and, recently, the emergence of multi-modal sensing for autonomous cars for example. In addition, it is an excellent proving ground for learning-based techniques since (a) it is in general (relatively) easy to generate realistic corrupted images from clean ones since reasonable models of the physical image corruption problem as often available (Abdelhamed et al., 2019; Nah et al., 2017), and (b) it is possible to incorporate natural image priors such as self-similarities (Buades et al., 2005) and sparsity (Mairal et al., 2009) in the modelling and optimization processes. We have conducted work on image restoration since the time of Julien Mairal's PhD thesis, addressing problems such as demosaicking, denoising, inpainting, and inverse half-toning with a combination of sparse coding/dictionary learning methods and non-local means, then moving on to blind deblurring including motion segmentation and, more recently, deep-learning methods. In our on-going efforts we address several challenges for learning-based approaches to image restoration: (i) how to combine different modalities such as depth and RGB images to improve the quality of the joint observations; (ii) how to construct tunable, fully interpretable approaches to image restoration in a functional framework; and (iii) how to incorporate machine learning methods that go beyond the traditional fully supervised setting into the image restoration pipeline.
Our current work in this area is outlined in detail in Section 8.4.
4 Application domains
We believe that foundational modeling work should be grounded in applications. This includes (but is not restricted to) the following high-impact domains.
4.1 Automated visual assistants
The modern seamless video communication has enabled new applications in education, medicine and manufacturing, such as remote surgery or remotely-supervised product assembly. The abundance of online instructional videos further confirms the high demand of assistance including daily tasks such as cooking and gardening. Our work on embodied video understanding and on the joint modeling of vision and language will support automatic visual assistants. Similar to existing driving navigation assistants, such applications will guide people in daily living, inspection and manufacturing tasks. Some of these applications are studied within our MSR-Inria collaboration.
4.2 Robotics
In 2023, the Willow team has pursued the development of the Pinocchio library both from a scientific and software perspective. The recent versions of Pinocchio now accounts for closed-loop mechanisms (based on a proximal optimization), code source generation on GPUs, etc. All these new features make Pinocchio a unique tool to efficiently control complex robotic systems such as legged robots or industrial robots. We are now closely collaborating with Pal Robotics which plan to use Pinocchio to control its next generation of humanoid robots called Kangaroo. In the near future, the plan is to extend Pinocchio to become a generic-purposed and efficient robotic simulator simulating both rigid and compliant contact interactions between a robot and its environment, with the ambition of making Pinocchio the next golden framework for simulation in robotics, offering advanced features for optimal control, reinforcement learning, like differentiable simulation. Such features should position Pinocchio as the central simulator in Robotics.
4.3 Image restoration
We are pursuing applications of our image restoration work to personal photography, to enhance the images taken by consumer cameras and smartphones by deblurring and denoising them, and improving their spatial resolution and dynamic range. In this context, we are collaborating with DXOMark, the world leader in smartphone camera evaluation, through a CIFRE thesis. Two of the objectives are to develop a public database of portraits fully compliant with European GDRP regulations with informed consent from the models, and to automate the rating of image quality using this dataset. We also apply the mixture of physical image formation model and machine learning principles that has made our image restoration work successful to scientific fields: We collaborate with Anne-Marie Lagrange (Observatoire de Paris), Maud Langlois (SNRS/Observatoire de Lyon) and Julien Mairal (Inria) on direct exoplanet detection from ground-based telescope imagery. This work also involves a post-doc, Olivier Flasseur, and a PhD Student, Théo Bodrito. We will apply next year the same principles to molecular microscopy, in collaboration with Jean-Baptiste Masson (Institut Pasteur).
5 Social and environmental responsibility
Artificial intelligence holds great potential for improving our environment, for example, by reducing energy consumption and optimizing energy production. Computer vision, in particular, can be used to monitor emissions from coal plants and to track forest growth using satellite imagery. Autonomous drones can monitor and prevent failures of pipelines, power lines, power plants and other remote installations. However, as larger and more powerful AI models require increased compute power at training and deployment, AI itself stands for an increasingly high carbon footprint. One direction of our research aims to develop efficient and low-resource neural network models. To this end, we have previously proposed Cross-Covariance Image Transformers (El-Nouby et al. NeurIPS'21) that avoid quadratic complexity in terms of image size. We have been also working on the development of new optimization methods and associated software (Bambade et al. ICLR'24) to reduce the overall computationel burden and reduce their energetical impact when applied to industrial and practical scenarios. In the light of these devleopments, with the help of the Inria Soft infrastructure, we are considering creating a new software consortium, named Maestro, to accelerate the developement and the dissemination of efficient algorithmic solutions for the control of robotics systems. One objective of this consortium is to provide software solutions that reduce the computational burden and energetic consumption of modern robots currently deployed in industry or in societal sectors.
6 Highlights of the year
6.1 Awards
- Cordelia Schmid has received the Archimedes Science Award, Dresden, 2025.
- Cordelia Schmid has received the Hans Fischer Senior Fellowship, TUM, 2025.
- Cordelia Schmid has received the ACM Athena Lecturer Award, 2025.
- Cordelia Schmid has received the Member of the National Academy of Artificial Intelligence (NAAI), 2025.
- Justin Carpentier and Cordelia Schmid have received the Prix de thèse du GdR Robotique 2025 for Q. Le Lidec.
- Ajay Sathya has received the IEEE Robotics and Automation Letters Best Paper Award.
- Quentin Le Lidec has been awarded the Best PhD Thesis Award in robotics by the French national robotics network.
- Antoine Bambade has received the Prix Paul Caseau, awarded by the French Academy of Technologies and EDF.
- Stéphane Caron has been awarded a PIQ grant, entitled OSS4EAI.
- Jean Ponce, together with Julien Mairal, have been awarded an i-Lab award for their startup Enhance Lab.
7 Latest software developments, platforms, open data
7.1 Latest software developments
7.1.1 alignsdf
-
Keywords:
Computer vision, 3D reconstruction
-
Functional Description:
This is the PyTorch implementation of the AlignSDF research paper:
AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction Zerui Chen, Yana Hasson, Ivan Laptev, Cordelia Schmid ECCV 2022
- Publication:
-
Contact:
Zerui Chen
-
Participant:
4 anonymous participants
7.1.2 BLERC
-
Name:
Benchmarking Learning Efficiency in Deep Reservoir Computing
-
Keywords:
Machine learning, Continual Learning
-
Functional Description:
Measuring learning efficiency of machine learning models.
- URL:
- Publication:
-
Contact:
Hugo Cisneros
7.1.3 BurstSR
-
Name:
Super-resolution from image bursts
-
Keyword:
Image processing
-
Functional Description:
This is a research prototpye allowing to take as input a sequence of raw or rgb images produced by a smartphone or digital camera. This code produces a high quality color images with higher resolution.
-
Release Contributions:
This new version, v0.2, introduces various improvements, as well as C++ code that accelerates the original Python code.
- Publication:
-
Contact:
Julien Mairal
-
Participant:
3 anonymous participants
7.1.4 FrozenBiLM
-
Name:
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
-
Keywords:
Computer vision, Natural language processing, Deep learning
-
Functional Description:
Code, datasets and models associated with the paper "Zero-Shot Video Question Answering via Frozen Bidirectional Language Models"
- URL:
-
Contact:
Antoine Yang
7.1.5 hiveformer
-
Keywords:
Robotics, NLP, Transformer
-
Functional Description:
This is the PyTorch implementation of the Hiveformer research paper:
Instruction-driven history-aware policies for robotic manipulations Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid CoRL 2022 (oral)
- Publication:
-
Contact:
Pierre-Louis Guhur
-
Participant:
6 anonymous participants
7.1.6 HM3DAutoVLN
-
Name:
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
-
Keyword:
Computer vision
-
Functional Description:
Open source release of the software package for the ECCV'22 paper by Chen et al. "Learning from Unlabeled 3D Environments for Vision-and-Language Navigation". This release provides a full implementation of the method, including code for training models, and testing on standard datasets, generated datasets as well as trained models.
- URL:
- Publication:
-
Contact:
Shizhe Chen
-
Participant:
5 anonymous participants
7.1.7 Just Ask: Learning to Answer Questions from Millions of Narrated Videos
-
Keywords:
Computer vision, Natural language processing, Deep learning
-
Functional Description:
Code, datasets and models associated with the paper "Just Ask: Learning to Answer Questions from Millions of Narrated Videos"
- URL:
-
Contact:
Antoine Yang
7.1.8 Pinocchio
-
Name:
Pinocchio
-
Keywords:
Robotics, Biomechanics, Mechanical multi-body systems
-
Functional Description:
Pinocchio instantiates state-of-the-art Rigid Body Algorithms for poly-articulated systems based on revisited Roy Featherstone's algorithms. In addition, Pinocchio instantiates analytical derivatives of the main Rigid-Body Algorithms like the Recursive Newton-Euler Algorithms or the Articulated-Body Algorithm. Pinocchio is first tailored for legged robotics applications, but it can be used in extra contexts. It is built upon Eigen for linear algebra and FCL for collision detection. Pinocchio comes with a Python interface for fast code prototyping.
- URL:
-
Contact:
Justin Carpentier
-
Partner:
CNRS
7.1.9 ProxSuite
-
Name:
ProxSuite
-
Keywords:
Conic optimization, Linear optimization, Robotics
-
Functional Description:
ProxSuite is a collection of open-source, numerically robust, precise and efficient numerical solvers (e.g., LPs, QPs, etc.) rooted in revisited primal-dual proximal algorithms. Through ProxSuite, we aim to offer the community scalable optimizers that can deal with dense, sparse or matrix-free problems. While the first targeted application is Robotics, ProxSuite can be used in other contexts without limits.
ProxSuite is actively developed and supported by the Willow and Sierra research groups, joint research teams between Inria, École Normale Supérieure de Paris and Centre National de la Recherche Scientifique localized in France.
-
Contact:
Justin Carpentier
7.1.10 SPE
-
Name:
Semantics Preserving Encoder
-
Keywords:
NLP, Adversarial attack, Word embeddings
-
Functional Description:
Semantics Preserving Encoder is a simple, fully supervised sentence embedding technique for textual adversarial attacks.
- URL:
-
Contact:
Hugo Cisneros
-
Participant:
3 anonymous participants
7.1.11 TubeDETR
-
Name:
TubeDETR: Spatio-Temporal Video Grounding with Transformers
-
Keywords:
Computer vision, Natural language processing, Deep learning
-
Functional Description:
Code, datasets and models associated with the paper "TubeDETR: Spatio-Temporal Video Grounding with Transformers"
- URL:
-
Contact:
Antoine Yang
7.1.12 vil3dref
-
Name:
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
-
Keyword:
Computer vision
-
Functional Description:
Open source release of the software package for the NeurIPS'22 paper by Chen et al. "Language Conditioned Spatial Relation Reasoning for 3D Object Grounding". This release provides a full implementation of the method, including code for training models, and testing on standard datasets, as well as trained models.
- URL:
- Publication:
-
Contact:
Shizhe Chen
-
Participant:
5 anonymous participants
7.1.13 VLN-DUET
-
Name:
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
-
Keyword:
Computer vision
-
Functional Description:
Open source release of the software package for the CVPR'22 paper by Chen et al. "Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation". This release provides a full implementation of the method, including codes for training models, and testing on standard datasets, as well as trained models.
- URL:
- Publication:
-
Contact:
Shizhe Chen
-
Participant:
5 anonymous participants
Participants: Jean Ponce, Justin Carpentier, Cordelia Schmid, Ivan Laptev, Etienne Arlaud, Pierre-Guillaume Raverdy, Stephane Caron, Shizhe Chen.
7.2 New platforms
Together with SED, we are bulding the new robotics laboratory at Inria Paris located on the 1st floor of the A building. The lab hosts a diverse set of robotic platforms covering dexterous manipulation, legged locomotion, and mobile robotics. The current equipment includes three UR5 robotic arms, an Allegro Hand, a Shadow Hand, and a TIAGo++ robot integrating both a mobile base and a manipulator. For legged and mobile experiments, the lab includes Upkie biped, the Unitree GO2 quadruped, and the ODRI Solo-12 quadruped. In 2025, the laboratory expanded its fleet with the acquisition of two Unitree G1 humanoid robots. The robotics laboratory is also equipped with a dedicated Motion Capture system for precise object localization and robot calibration. These robotic patforms will enable our future research and experiments with locomotion navigation and manipulation.
7.3 Open data
8 New results
8.1 Visual recognition and reconstruction of images and videos
8.1.1 MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Participants: Gabriel Fiastre, Antoine Yang, Cordelia Schmid.
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose in this work 44 to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at here.
Example of synthetic captions in our LV-VISCap dataset.
8.1.2 ComposeAnything: Composite Object Priors for Text-to-Image Generation
Participants: Zeeshan Khan, Shizhe Chen, Cordelia Schmid.
This paper 47 addresses the problem of Compositional text-to-image generation. Current text-to-image models struggle to generate scenes with many objects and complex relations. Training-time solutions such as layout conditioning or reinforcement learning improve compositional accuracy but often degrade image quality and realism by enforcing rigid constraints. To address this limitation, we introduce ComposeAnything, an inference-only framework that injects a structured composite object prior directly into the diffusion process. Rather than starting from random latent noises or performing expensive noise optimization, we construct a single 2.5D composite prior encoding strong object appearance, counts, sizes, and coarse depth-aware placement, and use it to initialize and guide one diffusion trajectory. This explicit prior is interpretable and editable in image space, enabling human-in-the-loop refinement by simply adjusting the composite. Our training-free, backbone-agnostic method improves compositional consistency on T2I-CompBench and NSR-1K benchmarks, particularly for complex prompts, while maintaining high visual quality compared to both training-based baselines and other inference-time methods.
ComposeAnything enables text-to-image generation for complex compositions involving surreal spatial relationships and high object counts. Achieving both high visual quality and strong faithfulness to text.
8.1.3 FACap: A Large-Scale Fashion Dataset for Fine-grained Composed Image Retrieval
Participants: François Gardères, Camille-Sovanneary Gauthier, Shizhe Chen, Jean Ponce.
The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce in this paper 33FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts. Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information. FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites.
Our automatically constructed FACap dataset offers more detailed and accurate annotations than existing datasets for the fashion CIR task.
8.1.4 Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Participants: Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol.
We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper 31, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.
8.1.5 Online 3D Scene Reconstruction Using Neural Object Priors
Participants: Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid.
This paper 23 addresses the problem of reconstructing a scene online at the level of objects given an RGB-D video sequence. While current object-aware neural implicit representations hold promise, they are limited in online reconstruction efficiency and shape completion. Our main contributions to alleviate the above limitations are twofold. First, we propose a feature grid interpolation mechanism to continuously update grid-based object-centric neural implicit representations as new object parts are revealed. Second, we construct an object library with previously mapped objects in advance and leverage the corresponding shape priors to initialize geometric object models in new videos, subsequently completing them with novel views as well as synthesized past views to avoid losing original object details. Extensive experiments on synthetic environments from the Replica dataset, real-world ScanNet sequences and videos captured in our laboratory demonstrate that our approach outperforms state-of-the-art neural implicit models for this task in terms of reconstruction accuracy and completeness.
Our method reconstructs scenes at the level of objects from RGB-D videos on the fly. We leverage 3D shape priors from a pre-computed object library to enhance accuracy and completeness of geometry reconstruction for individual objects.
8.1.6 Detecting Looted Archaeological Sites from Satellite Image Time Series
Participants: Elliot Vincent, Mehraïl Saroufim, Jonathan Chemla, Yves Ubelmann, Philippe Marquis, Jean Ponce, Mathieu Aubry.
Archaeological sites are the physical remains of past human activity and one of the main sources of information about past societies and cultures. However, they are also the target of malevolent human actions, especially in countries having experienced inner turmoil and conflicts. Because monitoring these sites from space is a key step towards their preservation, we introduce the DAFA Looted Sites dataset 32, a labeled multi-temporal remote sensing dataset containing 55,480 images acquired monthly over 8 years across 675 Afghan archaeological sites, including 135 sites looted during the acquisition period. It is particularly challenging because of the limited number of training samples, the class imbalance, the weak binary annotations only available at the level of the time series, and the subtlety of relevant changes coupled with important irrelevant ones over a long time period. It is also an interesting playground to assess the performance of satellite image time series (SITS) classification methods on a real and important use case. We evaluate a large set of baselines, outline the substantial benefits of using foundation models and show the additional boost that can be provided by using complete time series instead of using a single image.
8.1.7 Towards Zero-Shot Multimodal Machine Translation
Participants: Matthieu Futeral, Cordelia Schmid, Benoît Sagot, Rachel Bawden.
Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work 24, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.
8.1.8 mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Participants: Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot.
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR 25, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.
Example of a French document from mOSCAR.
8.1.9 Dual Perspectives on Non-Contrastive Self-Supervised Learning
Participants: Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel.
The stop gradient and exponential moving average iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they do not optimize the original objective, or any other smooth function, they do avoid collapse. Following Tian et al. 2021, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average always leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, asymptotically stable. Our theoretical findings 53 are illustrated by empirical experiments with real and synthetic data.
8.1.10 Optimal transport unlocks end-to-end learning for single-molecule localization
Participants: Romain Seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal.
Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores — fluorescent molecules stained onto the observed specimen — over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation 36, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at here.
8.2 Learning embodied representations
8.2.1 NextBestPath: Efficient 3D Mapping of Unseen Environments
Participants: Shiyao Li, Antoine Guédon, Clémentin Boittiaux, Shizhe Chen, Vincent Lepetit.
This paper 27 addresses the problem of active 3D mapping, where an agent must find an efficient trajectory to exhaustively reconstruct a new scene. Previous approaches mainly predict the next best view near the agent's location, which is prone to getting stuck in local areas. Additionally, existing indoor datasets are insufficient due to limited geometric complexity and inaccurate ground truth meshes. To overcome these limitations, we introduce a novel dataset AiMDoom with a map generator for the Doom video game, enabling to better benchmark active 3D mapping in diverse indoor environments. Moreover, we propose a new method we call next-best-path (NBP), which predicts long-term goals rather than focusing solely on short-sighted views. The model jointly predicts accumulated surface coverage gains for long-term goals and obstacle maps, allowing it to efficiently plan optimal paths with a unified model. By leveraging online data collection, data augmentation and curriculum learning, NBP significantly outperforms state-of-the-art methods on both the existing MP3D dataset and our AiMDoom dataset, achieving more efficient mapping in indoor environments of varying complexity.
Overview of the proposed next-best-path (NBP) framework. The model (left) predicts a value map of coverage gain and an obstacle map, which are used for decision making (right) to obtain a next-best path.
8.2.2 FOM-Nav: Frontier-Object Maps for Object Goal Navigation
Participants: Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid.
This paper 42 addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-ofthe-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.
The proposed frontier-object map is a rich representation of objects and frontiers (boundaries of the explored scene), displayed here as colored point clouds and red lines. It encodes geometric, distance and visual/textual information for frontiers and objects.
8.2.3 Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
Participants: Shizhe Chen, Ricardo Garcia, Paul Pacaud, Cordelia Schmid.
Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large language models (LLMs) for planning and action execution. While promising, these methods often fall short in generating grounded plans in visual environments. Although efforts have been made to perform visual instructional tuning on LLMs for robotic manipulation, existing methods are typically constrained by single-view image input and struggle with precise object grounding. In this work 43, we introduce Gondola, a novel grounded vision-language planning model based on LLMs for generalizable robotic manipulation. Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations. To support the training of Gondola, we construct three types of datasets using the RLBench simulator, namely robot grounded planning, multi-view referring expression and pseudo long-horizon task datasets. Gondola outperforms the state-of-the-art LLM-based method across all four generalization levels of the GemBench dataset, including novel placements, rigid objects, articulated objects and long-horizon tasks.
Gondola leverages multi-view images for 3D scene perception and segmentation masks to provide precisely grounded plans.
8.2.4 Collision avoidance from monocular vision trained with novel view synthesis
Participants: Valentin Tordjman--Levavasseur, Stéphane Caron.
Collision avoidance can be checked in explicit environment models such as elevation maps or occupancy grids, yet integrating such models with a locomotion policy requires accurate state estimation. In 58, we consider the question of collision avoidance from an implicit environment model. We use monocular RGB images as inputs and train a collisionavoidance policy from photorealistic images generated by 2D Gaussian splatting. We evaluate the resulting pipeline in realworld experiments under velocity commands that bring the robot on an intercept course with obstacles. Our results suggest that RGB images can be enough to make collision-avoidance decisions, both in the room where training data was collected and in out-of-distribution environments.
Effect of the vision-based collision-avoidance policy when the commanded velocity prompts the robot to collide with a wall. Blue: joystick user input, kept stationary at full forward throttle. Green: trajectory actually followed by the robot after compensation by the policy.
8.2.5 KernelSOS for Global Sampling-Based Optimal Control and Estimation via Semidefinite Programming
Participants: Antoine Groudiev, Fabian Schramm, Eloïse Berthier, Justin Carpentier, Frederike Dümbgen.
Global optimization has gained attraction over the past decades, thanks to the development of both theoretical foundations and efficient numerical routines to cope with optimization problems of various complexities. Among recent methods, Kernel Sum of Squares (KernelSOS) appears as a powerful framework, leveraging the potential of sum of squares methods from the polynomial optimization community with the expressivity of kernel methods widely used in machine learning. This paper 46 applies the kernel sum of squares framework for solving control and estimation problems, which exhibit poor local minima. We demonstrate that KernelSOS performs well on a selection of problems from both domains. In particular, we show that KernelSOS is competitive with other sum of squares approaches on estimation problems, while being applicable to non-polynomial and non-parametric formulations. The samplebased nature of KernelSOS allows us to apply it to trajectory optimization problems with an integrated simulator treated as a black box, both as a standalone method and as a powerful initialization method for local solvers, facilitating the discovery of better solutions.
8.2.6 Sobolev Diffusion Policy
Participants: Theotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier.
This paper 48 introduces a novel framework to combine the strengths of policy learning and trajectory optimization effectively. On the one hand, it builds upon diffusion policy, an expressive imitation learning method based on diffusion probabilistic generative models. On the other hand, it uses gradient-based trajectory optimization solvers to generate locally optimal trajectories and leverage their associated feedback gains, doing Sobolev training with first-order information. Combining both, we introduce a first-order loss for diffusion-based policies. The framework alternates between collecting trajectories using a solver warm-started by the policy and training. Through comprehensive experiments, we demonstrate how the Sobolev component significantly reduces the number of trajectories required for the policy to converge globally. First-order information both avoids overfitting, despite using very few samples, and mitigates the compounding error issue of imitation-based policies, even when predicting torques for tasks requiring high-frequency control. We benchmark the benefits of SDP on various robotics tasks of increasing complexity. In particular, SDP shows to be stable over extended horizons, with fewer diffusion steps, shrinking the overall rollout time compared to vanilla diffusion models. And when used to compute initial guesses for trajectory optimization, it reduces the solving time by a factor of 2 to 20.
A task involving to move the arm's end-effector from the blue sphere to the red one. The pink trajectory is obtained by a trajectory optimization (TO) solver alone, the orange one by our SDP method, and the gray one is the SDP trajectory refined by the solver. SDP finds more direct trajectories, while the TO solver may be stuck in local minima.
8.2.7 First-order Sobolev Reinforcement Learning
Participants: Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier.
This paper 55 proposes a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.
Comparison of -function slices.
8.2.8 Control of Humanoid Robots with Parallel Mechanisms using Differential Actuation Models
Participants: Victor Lutz, Ludovic de Matteis, Virgile Batto, Nicolas Mansard.
Several recently released humanoid robots, inspired by the mechanical design of Cassie, employ actuator configurations in which the motors are displaced from the joints to reduce leg inertia. While studies accounting for the full kinematic complexity have demonstrated the benefits of these designs, the associated loop-closure constraints greatly increase computational cost and limit their use in control and learning. As a result, the non-linear transmission is often approximated by a constant reduction ratio, preventing exploitation of the mechanism’s full capabilities. This paper 50 introduces a compact analytical formulation for the two standard knee and ankle mechanisms that captures the exact non-linear transmission while remaining computationally efficient. The model is fully differentiable up to second order with a minimal formulation, enabling low-cost evaluation of dynamic derivatives for trajectory optimization and of the apparent transmission impedance for reinforcement learning. We integrate this formulation into trajectory optimization and locomotion policy learning, and compare it against simplified constant-ratio approaches. Hardware experiments demonstrate improved accuracy and robustness, showing that the proposed method provides a practical means to incorporate parallel actuation into modern control algorithms.
8.2.9 On the Conic Complementarity of Planar Contacts
Participants: Yann de Mont-Marin, Louis Montaut, Jean Ponce, Martial Hebert, Justin Carpentier.
We present a unifying theoretical result 30 that connects two foundational principles in robotics: the Signorini law for point contacts, which underpins many simulation methods for preventing object interpenetration, and the center of pressure (also known as the zero-moment point), a key concept used in, for instance, optimization-based locomotion control. Our contribution is the planar Signorini condition, a conic complementarity formulation that models general planar contacts between rigid bodies. We prove that this formulation is equivalent to enforcing the punctual Signorini law across an entire contact surface, thereby bridging the gap between discrete and continuous contact models. A geometric interpretation reveals that the framework naturally captures three physical regimes -sticking, separating, and tilting-within a unified complementarity structure. This leads to a principled extension of the classical center of pressure, which we refer to as the extended center of pressure. By establishing this connection, our work provides a mathematically consistent and computationally tractable foundation for handling planar contacts, with implications for both the accurate simulation of contact dynamics and the design of advanced control and optimization algorithms in locomotion and manipulation.
8.2.10 Reference-Free Sampling-Based Model Predictive Control
Participants: Fabian Schramm, Pierre Fabre, Nicolas Perrin-Gilbert, Justin Carpentier.
This paper 35 presents a sampling-based model predictive control (MPC) framework that enables emergent locomotion without relying on handcrafted gait patterns or predefined contact sequences. Our method discovers diverse motion patterns, ranging from trotting to galloping, robust standing policies, jumping, and handstand balancing, purely through the optimization of high-level objectives. Building on model predictive path integral (MPPI), we propose a dual-space spline parameterization that operates on position and velocity control points. Our approach enables contact-making and contact-breaking strategies that adapt automatically to task requirements, requiring only a limited number of sampled trajectories. This sample efficiency allows us to achieve real-time control on standard CPU hardware, eliminating the need for GPU acceleration typically required by other state-of-the-art MPPI methods. We validate our approach on the Go2 quadrupedal robot, demonstrating various emergent gaits and basic jumping capabilities. In simulation, we further showcase more complex behaviors, such as backflips, dynamic handstand balancing and locomotion on a Humanoid, all without requiring reference tracking or offline pre-training.
Overview of the framework showing the dual-spline parametrization, noise schedule and reference-free costs.
8.2.11 Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
Participants: Franki Nguimatsia Tiofack, Théotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier.
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP) 34, which couples a multistep flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks.
Overview of the Guided Flow Policy framework.
8.2.12 Contact-Implicit Inverse Dynamics
Participants: Etienne Ménager, Pierre Fabre, Antoinre Bambade, Wilson Jallet, Alberto De Marchi, Justin Carpentier.
Task-space inverse dynamics, also known as operational space control, is a popular control paradigm for controlling robots in real-time. It enables the control or stabilization of robot dynamics around reference trajectories while accounting for under-actuation, actuator limits, and contact interactions. Over the past few decades, this versatile control paradigm has been successfully deployed in numerous robotics settings, ranging from quadrupeds and humanoid robots to deformable robots, in scenarios involving rich physical contact interactions between a robot and its environment. In practice, contact-aware inverse dynamics controllers assume that contact sequences are known in advance, typically provided by a higher-level contact planner, which inherently limits their ability to select among breaking, sliding, or sticking contacts automatically.
In this paper 51, we extend the control formalism of task-space inverse dynamics, which is classically formulated as a quadratic program, to a more general quadratic program with complementarity constraints (QPCC). This formulation fully accounts for actuator limits and frictional contacts, modeled as nonlinear complementary constraints. To solve these QPCC problems, we draw inspiration from the alternating direction method of multipliers to devise an iterative optimization approach that alternates between minimizing a smooth convex function that accounts for task objectives and system dynamics, and projecting over convex and non-convex sets that capture actuator and complementary frictional contact constraints. By notably handling complementary frictional contact constraints through projection, our approach enables us to implicitly and automatically reason about the optimal contact modes that fulfill the task objectives and constraints. We have implemented our QPCC solver in C++ for efficiency, and demonstrate its usability and versatility on rigid and soft robots across various control scenarios, ranging from the control of actuated box sliding on the grounds, to control balance of legged robots that automatically break and create contacts (e.g., jumping tasks, balancing tasks) or control of deformable robots interacting with their environment.
The proposed method generically handles inverse dynamics with frictional contact for both rigid and soft robots.
8.3 A Data-driven Contact Estimation Method for Wheeled-Biped Robots
Participants: Ü. Bora Gökbakan, Frederike Dümbgen, Stéphane Caron.
Contact estimation is a key ability for limbed robots, where making and breaking contacts has a direct impact on state estimation and balance control. Existing approaches typically rely on gate-cycle priors or designated contact sensors. We design a contact estimator that is suitable for the emerging wheeled-biped robot types that do not have these features. To this end, we propose a Bayes filter 45 in which update steps are learned from real-robot torque measurements while prediction steps rely on inertial measurements. We evaluate this approach in extensive real-robot and simulation experiments. Our method achieves better performance while being considerably more sample efficient than a comparable deep-learning baseline.
Robustly detecting the moments when a wheeled-biped robot makes and breaks contact is crucial for successful estimation and control. This paper proposes a contact estimator based only on inertial and torque measurements. The measurements are fed into a novel Bayesian filter formulation to robustly estimate the binary contact state. We validate our results extensively both in simulation and real-world experiments, as depicted in the bottom figure.
8.3.1 End-to-End and Highly-Efficient Differentiable Simulation for Robotics
Participants: Quentin Le Lidec, Louis Montaut, Yann de Mont-Marin, Justin Carpentier.
Over the past few years, robotics simulators have largely improved in efficiency and scalability, enabling them to generate years of simulated data in a few hours. Yet, efficiently and accurately computing the simulation derivatives remains an open challenge, with potentially high gains on the convergence speed of reinforcement learning and trajectory optimization algorithms, especially for problems involving physical contact interactions. This paper 49 contributes to this objective by introducing a unified and efficient algorithmic solution for computing the analytical derivatives of robotic simulators. The approach considers both the collision and frictional stages, accounting for their intrinsic nonsmoothness and also exploiting the sparsity induced by the underlying multibody systems. These derivatives have been implemented in C++, and the code will be open-sourced in the Simple simulator. They depict state-of-the-art timings ranging from 5 us for a 7-dof manipulator up to 95 us for 36-dof humanoid, outperforming alternative solutions by a factor of at least 100.
Illustration of the sliding mode.
8.3.2 Differentiable Simulation of Soft Robots with Frictional Contacts
Participants: Etienne Ménager, Louis Montaut, Quentin Le Lidec, Justin Carpentier.
In recent years, soft robotics simulators have evolved to offer various functionalities, including the simulation of different material types (e.g., elastic, hyper-elastic) and actuation methods (e.g., pneumatic, cable-driven, servomotor). These simulators also provide tools for various tasks, such as calibration, design, and control. However, efficiently and accurately computing derivatives within these simulators remains a challenge, particularly in the presence of physical contact interactions. Incorporating these derivatives can, for instance, significantly improve the convergence speed of control methods like reinforcement learning and trajectory optimization, enable gradient-based techniques for design, or facilitate end-to-end machine-learning approaches for model reduction. This paper 29 addresses these challenges by introducing a unified method for computing the derivatives of mechanical equations within the finite element method framework, including contact interactions modeled as a nonlinear complementarity problem. The proposed approach handles both collision and friction phases, accounts for their nonsmooth dynamics, and leverages the sparsity introduced by mesh-based models. Its effectiveness is demonstrated through several examples of controlling and calibrating soft systems.
8.3.3 Constrained Articulated Body Algorithms for Closed-Loop Mechanisms
Participants: Ajay Suresha Sathya, Justin Carpentier.
Efficient rigid-body dynamics algorithms are instrumental in enabling high-frequency dynamics evaluation for resource-intensive applications (e.g., model predictive control, large-scale simulation, reinforcement learning), potentially on resource-constrained hardware. Existing recursive algorithms with low computational complexity are mostly restricted to kinematic trees with external contact constraints or are sensitive to singular cases (e.g., linearly dependent constraints and kinematic singularities), severely impacting their practical usage in existing simulators. This article 54 introduces two original lowcomplexity recursive algorithms, loop-constrained articulated body algorithm (LCABA) and proxBBO, based on proximal dynamics formulation for forward simulation of mechanisms with loops. These algorithms are derived from first principles using non-serial dynamic programming, depict linear complexity in practical scenarios, and are numerically robust to singular cases. They extend the existing constrained articulated body algorithm (constrainedABA) to handle internal loops and the pioneering BBO algorithm from the 1980s to singular cases. Both algorithms have been implemented by leveraging the open-source Pinocchio library, benchmarked in detail, and depict state-ofthe-art performance for various robot topologies, including over 6x speed-ups compared to existing non-recursive algorithms for high degree-of-freedom systems with internal loops such as recent humanoid robots.
8.3.4 A Data-driven Contact Estimation Method for Wheeled-Biped Robots
Participants: Ü. Bora Gökbakan, Frederike Dümbgen, Stéphane Caron.
Contact estimation is a key ability for limbed robots, where making and breaking contacts has a direct impact on state estimation and balance control. Existing approaches typically rely on gate-cycle priors or designated contact sensors. In this work 45, we design a contact estimator that is suitable for the emerging wheeled-biped robot types that do not have these features. To this end, we propose a Bayes filter in which update steps are learned from real-robot torque measurements while prediction steps rely on inertial measurements. We evaluate this approach in extensive real-robot and simulation experiments. Our method achieves better performance while being considerably more sample efficient than a comparable deep-learning baseline.
Robustly detecting the moments when a wheeled-biped robot makes and breaks contact is crucial for successful estimation and control. This paper proposes a contact estimator based only on inertial and torque measurements. The measurements are fed into a novel Bayesian filter formulation to robustly estimate the binary contact state. We validate our results extensively both in simulation and real-world experiments, as depicted in the bottom figure.
8.3.5 Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models
Participants: Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid.
This paper 52 addresses the problem of reliable failure detection and recovery in Robotic Manipulation. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection, as illustrated in Fig. 18. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
llustration of our Guardian model - a VLM fine-tuned on our constructed failure datasets. It detects planning failures (top) and execution failures (bottom) in robotic manipulation.
8.3.6 Augmented Lagrangian methods for infeasible convex optimization problems and diverging proximal-point algorithms
Participants: Roland Andrews, Justin Carpentier, Adrien Taylor.
This work investigates the convergence behavior of augmented Lagrangian methods (ALMs) when applied to convex optimization problems that may be infeasible. ALMs are a popular class of algorithms for solving constrained optimization problems. We establish progressively stronger convergence results, ranging from basic sequence convergence to precise convergence rates, under a hierarchy of assumptions. In particular, we demonstrate that, under mild assumptions, the sequences of iterates generated by ALMs converge to solutions of the “closest feasible problem”.
This study leverages the classical relationship between ALMs and the proximal-point algorithm applied to the dual problem. A key technical contribution is a set of concise results on the behavior of the proximal-point algorithm when applied to functions that may not have minimizers. These results pertain to its convergence in terms of its subgradients and of the values of the convex conjugate.
8.3.7 Certifiably optimal rotation and pose estimation based on the Cayley map
Participants: Timothy D Barfoot, Connor Holmes, Frederike Dümbgen.
We present novel, convex relaxations for rotation and pose estimation problems that can a posteriori guarantee global optimality for practical measurement noise levels. Some such relaxations exist in the literature for specific problem setups that assume the matrix von Mises-Fisher distribution (a.k.a., matrix Langevin distribution or chordal distance) for isotropic rotational uncertainty. However, another common way to represent uncertainty for rotations and poses is to define anisotropic noise in the associated Lie algebra. Starting from a noise model based on the Cayley map, we define our estimation problems, convert them to Quadratically Constrained Quadratic Programs (QCQPs), then relax them to Semidefinite Programs (SDPs), which can be solved using standard interior-point optimization methods; global optimality follows from Lagrangian strong duality. We first show how to carry out basic rotation and pose averaging. We then turn to the more complex problem of trajectory estimation, which involves many pose variables with both individual and inter-pose measurements (or motion priors). Our contribution 12 is to formulate SDP relaxations for all these problems based on the Cayley map (including the identification of redundant constraints) and to show them working in practical settings. We hope our results can add to the catalogue of useful estimation problems whose solutions can be a posteriori guaranteed to be globally optimal.
8.3.8 Human-Robot Co-Simulation method for upper limb assistive force calculation using polytopes
Participants: Maël Gallois, Mégane Millan, Nicolas Vignais, Sylvain Guégan, Marie Babel, Justin Carpentier, Charles Pontonnier.
Assisting the upper limb constitutes a significant challenge in the rehabilitation and readaptation of individuals with neuromuscular and/or neurodegenerative disorders. To address this issue, robotic devices such as exoskeletons have been designed. However, the control of these devices remains intricate and challenging, particularly in the context of the upper limb. The objective of this study 13 is to define a method to compute the assistance an exoskeleton should provide to the user, according to its capabilities. With the objective to minimize the user voluntary effort, optimal assisting forces that minimize the torque the user must exert along a movement while maximizing the forces provided at the user interfaces are computed. Polytopes (Skiric 2023) are used to determine feasible sets.
8.3.9 PROXQP: an Efficient and Versatile Quadratic Programming Solver for Real-Time Robotics Applications and Beyond
Participants: Antoine Bambade, Fabian Schramm, Sarah El Kazdadi, Stéphane Caron, Adrien Taylor, Justin Carpentier.
Convex Quadratic programming (QP) has become a core component in the modern engineering toolkit, particularly in robotics, where QP problems are legions, ranging from real-time whole-body controllers to planning and estimation algorithms. Many of those QPs need to be solved at high frequency. Meeting timing requirements requires taking advantage of as many structural properties as possible for the problem at hand. For instance, it is generally crucial to resort to warm-starting to exploit the resemblance of consecutive control iterations. While a large range of off-the-shelf QP solvers is available, only a few are suited to exploit problem structure and warm-starting capacities adequately. In this work 11, we propose the PROXQP algorithm, a new and efficient QP solver that exploits QP structures by leveraging primal-dual augmented Lagrangian techniques. For convex QPs, PROXQP features a global convergence guarantee to the closest feasible QP, an essential property for safe closedloop control. We illustrate its practical performance on various standard robotic and control experiments, including a real-world closed-loop model predictive control application. While originally tailored for robotics applications, we show that PROXQP also performs at the level of state of the art on generic QP problems, making PROXQP suitable for use as an off-the-shelf solver for regular applications beyond robotics.
8.3.10 Optimal Control of Walkers with Parallel Actuation
Participants: Ludovic de Matteis, Virgile Batto, Justin Carpentier, Nicolas Mansard.
Legged robots with complex kinematic architectures, such as parallel linkages, offer significant advancements in mobility and efficiency. However, generating versatile movements for these robots requires accurate dynamic modeling that reflects their specific mechanical structures. Previous approaches often relied on simplified models, resulting in sub-optimal control, particularly in tasks requiring the full actuator range. Here 28, we present a method that fully models the dynamics of legged robots with parallel linkages, formulating their motion generation as an optimal control problem with specific contact dynamics. We introduce 6D kinematic closure constraints and derive their analytical derivatives, enabling the solver to exploit nonlinear transmission and the consequent variable actuator reduction. This approach reduces peak motor torques and expands the usable range of actuator motion and force. We empirically demonstrate that fully modeling the kinematics leads to superior performance, especially in demanding tasks such as fast walking and stair climbing. Beyond serialparallel designs, our method also addresses motion generation for fully-parallel walkers.
8.3.11 Extended URDF: Accounting for parallel mechanism in robot description
Participants: Virgile Batto, Ludovic de Matteis, Nicolas Mansard.
Robotic designs played an important role in recent advances by providing powerful robots with complex mechanics. Many recent systems rely on parallel actuation to provide lighter limbs and allow more complex motion. However, these emerging architectures fall outside the scope of most used description formats, leading to difficulties when designing, storing, and sharing the models of these systems. This paper 18 introduces an extension to the widely used Unified Robot Description Format (URDF) to support closed-loop kinematic structures. Our approach relies on augmenting URDF with minimal additional information to allow more efficient modeling of complex robotic systems while maintaining compatibility with existing design and simulation frameworks. This method sets the basic requirement for a description format to handle parallel mechanisms efficiently. We demonstrate the applicability of our approach by providing an open-source collection of parallel robots, along with tools for generating and parsing this extended description format. The proposed extension simplifies robot modeling, reduces redundancy, and improves usability for advanced robotic applications.
8.3.12 PROXDDP: Proximal Constrained Trajectory Optimization
Participants: Wilson Jallet, Antoine Bambade, Etienne Arlaud, Sarah El Kazdadi, Nicolas Mansard, Justin Carpentier.
Trajectory optimization has been a popular choice for motion generation and control in robotics for at least a decade. Several numerical approaches have exhibited the required speed to enable online computation of trajectories for real-time of various systems, including complex robots. Many of these said are based on the differential dynamic programming (DDP) algorithm – initially designed for unconstrained trajectory optimization problems – and its variants, which are relatively easy to implement and provide good runtime performance. However, several problems in robot control call for using constrained formulations (e.g. torque limits, obstacle avoidance), from which several difficulties arise when trying to adapt DDP-type methods: numerical stability, computational efficiency, and constraint satisfaction.In this article 14, we leverage proximal methods for constrained optimization and introduce a DDP-type method for fast, constrained trajectory optimization suited for model-predictive control (MPC) applications with easy warm-starting.Compared to earlier solvers, our approach effectively manages hard constraints without warm-start limitations and exhibits good convergence behavior. We provide a complete implementation as part of an open-source and flexible C++ trajectory optimization library called ALIGATOR. These algorithmic contributions are validated through several trajectory planning scenarios from the robotics literature and the real-time whole-body MPC of a quadruped robot.
8.3.13 Structure-Exploiting Sequential Quadratic Programming for Model-Predictive Control
Participants: Armand Jordana, Sébastien Kleff, Avadesh Meduri, Justin Carpentier, Nicolas Mansard, Ludovic Righetti.
The promise of model-predictive control in robotics has led to extensive development of efficient numerical optimal control solvers in line with differential dynamic programming because it exploits the sparsity induced by time. In this work 15, we argue that this effervescence has hidden the fact that sparsity can be equally exploited by standard nonlinear optimization. In particular, we show how a tailored implementation of sequential quadratic programming achieves state-of-the-art model-predictive control. Then, we clarify the connections between popular algorithms from the robotics community and well-established optimization techniques. Further, the sequential quadratic program formulation naturally encompasses the constrained case, a notoriously difficult problem in the robotics community. Specifically, we show that it only requires a sparsity-exploiting implementation of a state-of-the-art quadratic programming solver. We illustrate the validity of this approach in a comparative study and experiments on a torque-controlled manipulator. To the best of our knowledge, this is the first demonstration of closed loop nonlinear model-predictive control with constraints on a real robot.
8.3.14 Modeling, Embedded Control and Design of Soft Robots using a Learned Condensed FEM Model
Participants: Tanguy Navez, Etienne Ménager, Paul Chaillou, Olivier Goury, Alexandre Kruszewski, Christian Duriez.
The Finite Element Method (FEM) is a powerful modeling tool for predicting soft robots' behavior, but its computation time can limit practical applications. In this paper 16, a learning-based approach based on condensation of the FEM model is detailed. The proposed method handles several kinds of actuators and contacts with the environment. We demonstrate that this compact model can be learned as a unified model across several designs and remains very efficient in terms of modeling since we can deduce the direct and inverse kinematics of the robot. Building upon the intuition introduced in [11], the learned model is presented as a general framework for modeling, controlling, and designing soft manipulators. First, the method's adaptability and versatility are illustrated through optimizationbased control problems involving positioning and manipulation tasks with mechanical contact-based coupling. Secondly, the lowmemory consumption and the high prediction speed of the learned condensed model are leveraged for real-time embedding control without relying on costly online FEM simulation. Finally, the ability of the learned condensed FEM model to capture soft robot design variations and its differentiability are leveraged in calibration and design optimization applications.
8.3.15 Infinite-Horizon Value Function Approximation for Model Predictive Control
Participants: Armand Jordana, Sébastien Kleff, Arthur Haffemayer, Joaquim Ortiz-Haro, Justin Carpentier, Nicolas Mansard, Ludovic Righetti.
Model Predictive Control has emerged as a popular tool for robots to generate complex motions. However, the real-time requirement has limited the use of hard constraints and large preview horizons, which are necessary to ensure safety and stability. In practice, practitioners have to carefully design cost functions that can imitate an infinite horizon formulation, which is tedious and often results in local minima. In this work 17, we study how to approximate the infinite horizon value function of constrained optimal control problems with neural networks using value iteration and trajectory optimization. Furthermore, we demonstrate how using this value function approximation as a terminal cost provides global stability to the model predictive controller. The approach is validated on two toy problems and a real-world scenario with online obstacle avoidance on an industrial manipulator where the value function is conditioned to the goal and obstacle.
8.4 Image restoration and enhancement
8.4.1 A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations
Participants: Théo Bodrito, Olivier Flasseur, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange.
The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper 19 presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.
8.4.2 Deep learning for exoplanet detection and characterization by direct imaging at high contrast
Participants: Théo Bodrito, Olivier Flasseur, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange.
Exoplanet imaging is a major challenge in astrophysics due to the need for high angular resolution and high contrast. We present a multi-scale statistical model 20 for the nuisance component corrupting multivariate image series at high contrast. Integrated into a learnable architecture, it leverages the physics of the problem and enables the fusion of multiple observations of the same star in a way that is optimal in terms of detection signal-to-noise ratio. Applied to data from the VLT/SPHERE instrument, the method significantly improves the detection sensitivity and the accuracy of astrometric and photometric estimation.
8.4.3 Joint statistical modeling and deep learning for exoplanet detection and characterization by direct imaging at high contrast
Participants: Théo Bodrito, Olivier Flasseur, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange.
The detection of exoplanets, the characterization of their atmospheres, and the study of exoplanet formation mechanisms are major current challenges in astrophysics. High-contrast direct imaging (HCI) is one of the observational techniques of choice to address these questions. However, such observations are particularly demanding due to the extreme contrast levels and angular resolution required. In addition to the use of extreme adaptive optics and coronagraphs, advances in data science have become critical for analyzing these observations and disentangling the signals of interest (exoplanets and circumstellar disks) from the strong nuisance component (speckles and noise) that corrupts the data. In this context, we will present our recent developments in deep learning applied to HCI 21, aimed at the optimal and reliable extraction of astrophysical information from multivariate observations (including spatial, temporal, spectral, and multi-epoch diversity). These approaches are based on a fine modeling of the different components contributing to the total signal and incorporate physical domain knowledge as prior information. Emphasis will be placed on (i) combining deep learning models with statistical modeling of the nuisance, (ii) leveraging large archival datasets as a valuable source of diversity for tackling the unmixing task, and (iii) jointly exploiting the spectral diversity of observations. Our methods are tailored to the specific challenges of high-contrast imaging: (i) very low signal-to-noise ratios and non-stationary noise, (ii) detection of rare events, and (iii) absence of ground truth. Using data from the VLT/SPHERE instrument, we will show that these approaches enable fine modeling and effective subtraction of the nuisance component, leading to reliable and nearly optimal estimates of the astrophysical quantities of interest. This results in significantly improved detection sensitivity and more accurate astro-photometric characterization. The proposed approaches are also scalable and readily applicable to large-scale surveys. Looking ahead, instruments on the next generation of thirty-meter-class telescopes will enable the exploration of the innermost environments of Sun-like stars at unprecedented contrast levels. Achieving the associated scientific goals will require addressing several data science challenges: (i) approaching the ultimate performance limits of the instruments through optimal signal extraction, (ii) capturing complex, spatially structured nuisance exhibiting strong variability, and (iii) building robust nuisance models that go beyond the limitations of angular differential imaging, particularly in the vicinity of the host star. We will discuss these challenges in light of the methodological developments presented.
8.4.4 Modèle statistique apprenable de mélange de distributions et fusion de données multivariées pour l'imagerie d'exoplanètes
Participants: Théo Bodrito, Olivier Flasseur, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange.
Exoplanet imaging is a major challenge in astrophysics due to the high star-planet contrast. This paper 22 presents a multi-scale statistical model for the nuisance component corrupting multivariate image series. Integrated into a learnable architecture, it leverages the physics of the problem and enables the fusion of multiple observations of the same star. Applied to real data, the method significantly improves the detection sensitivity and the accuracy of exoplanet position and flux estimation.
8.4.5 CoDEx: Combining Domain Expertise for Spatial Generalization in Satellite Image Analysis
Participants: Abhishek Kuriyal, Elliot Vincent, Mathieu Aubry, Loic Landrieu.
Global variations in terrain appearance raise a major challenge for satellite image analysis, leading to poor model performance when training on locations that differ from those encountered at test time. This remains true even with recent large global datasets. To address this challenge, we propose a novel domain-generalization framework for satellite images 26. Instead of trying to learn a single generalizable model, we train one expert model per training domain, while learning experts' similarity and encouraging similar experts to be consistent. A model selection module then identifies the most suitable experts for a given test sample and aggregates their predictions. Experiments on four datasets (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains over existing domain generalization and adaptation methods.
8.5 Doctoral dissertations and habilitation theses
8.5.1 Deep learning for exoplanet detection in high contrast imaging
Participants: Théo Bodrito.
The thesis 37 addresses the challenge of detecting and characterizing exoplanets through direct imaging, a technique hindered by the extreme contrast and small angular separation between stars and planets. To overcome these issues, this work introduces three hybrid approaches that combine statistical modeling with deep learning, leveraging large datasets from high-contrast imaging surveys. The deep PACO method integrates a local statistical model of the nuisance component with a convolutional neural network, improving detection and characterization performance over classical algorithms. MODEL&CO further advances this by learning a unique model across a large multi-observations datasets, enabling robust detection even in challenging conditions. ExoMILD introduces a multi-scale, multi-spectral statistical framework that exploits spatial symmetries, achieving superior sensitivity and unbiased parameter estimation. Extensive testing on semi-synthetic and real datasets demonstrates significant gains in contrast and robustness, particularly at small separations. The approaches are designed to generalize across diverse observing conditions and are well-suited for future large-scale surveys. Overall, the thesis establishes a new generation of deep learning-based tools for exoplanet imaging, enabling more sensitive and reliable exploration of planetary systems.
8.5.2 Learning dexterous manipulation from 3D hand and object interaction
Participants: Zerui Chen.
In this thesis 39, we advance the understanding of 3D hand motions and hand-object interactions in monocular videos. We show that how these insights can empower robots with human-like dexterous manipulation capabilities. Our approach achieves dense 3D reconstructions of both hands and objects, capturing their fine-grained interactions while maintaining fast inference speed. We investigate how to leverage these 3D reconstruction results to transfer human manipulation skills to multi-fingered robotic hands through trajectory-guided reinforcement learning and vision-based imitation learning. By effectively connecting visual motion capture with robotic execution, our work creates new opportunities for human-robot collaboration. Our contributions are structured into four key areas: First, we propose a joint learning framework for 3D reconstruction of hands and objects using signed distance functions (SDFs). This method generates high-resolution meshes and captures detailed hand-object interactions. Second, to improve the alignment between the reconstructed 3D shape and its underlying poses, we leverage hand kinematic structures to guide SDF-based reconstruction, which helps enhance visual features and increase robustness to occlusions. Third, while SDF-based methods yield promising results, they are computationally intensive and often produce overly smooth surfaces. To address this, we introduce a novel transformer-based approach for reconstructing dense point clouds of hand-held objects, achieving high-quality 3D reconstructions with fast inference speed. Finally, although vision systems produce visually plausible 3D hand and object configurations, these configurations may not always be physically plausible, which make them less useful for robot learning. To tackle this, we develop ViViDex and first employ reinforcement learning to refine these noisy configurations. Then, we apply imitation learning to train a unified vision-based policy from refined trajectories. As a result, ViViDex generates natural manipulation sequences and demonstrates superior performance across three dexterous manipulation tasks.
8.5.3 Analysis of satellite image time series for classification and change detection
Participants: Elliot Vincent.
This thesis 41 develops machine learning methods for analyzing time series of satellite images (STIS), focusing on soil classification and semantic change detection. We propose three main areas of improvement. First, we design architectures specifically tailored for STIS: DTI-TS for agricultural classification, multiUTAE for change detection, and a combination of a foundation model with a temporal attention model for detecting archaeological looting. Second, we address the lack of annotated data by developing weakly supervised methods and introducing a dataset for archaeological looting detection. Finally, we examine the impact of spatial and temporal domain shifts on model performance. The DTI-TS method aligns time series prototypes with data using spectral and temporal transformations. It excels in contexts with temporal shifts and data scarcity while maintaining good interpretability. MultiUTAE segments all images in a series simultaneously, leveraging information over a broad temporal window. This sequence-to-sequence approach outperforms methods that process images individually or in pairs across various domain shift scenarios. For archaeological looting detection, the thesis introduces DAFA-LS, a dataset of Afghan sites. The best performance is achieved by a method combining a pre-trained foundation model and an attention model. Future research directions include leveraging foundation models and multi-modality, enhancing time series by improving resolution or adding elevation data, and developing unsupervised learning and domain adaptation to mitigate the lack of annotated data.
8.5.4 Object-centric representations for sensing and planning in visually-guided robotics
Participants: Thomas Chabal.
This thesis 38 pursues the ultimate goal to develop autonomous and intelligent robotic assistants with the abilities to perceive and understand the world, and explore and act on it. Such systems would have to navigate to and interact with a variety of individual components, or objects. Working from images, this thesis specifically aims at developing novel object-centric representations and algorithms to perceive and reconstruct scenes, before planning robotic manipulation and navigation actions. It addresses three main challenges: handling failures of visual systems and partial knowledge of goals in robotic assembly tasks, efficiently acquiring accurate and complete object-level representations of scenes in an online fashion, and learning to understand the semantics of human-arranged environments to explore and search for objects with a mobile robot. We advance the field through the following three contributions. First, we study the problem of stacking objects with a robotic manipulator to reproduce an assembly specified through a single photograph. As visual systems encounter unavoidable failures in analyzing images, notably due to occlusions, the target structure is only partially known. We present an approach intertwining an abstract search for a high-level assembly plan and a physical grounding of candidate plans in the real world. Our method, deployed on a robotic manipulator, builds stable structures that match the goal assemblies, known by extracting object poses with an off-the-shelf procedure in the goal image. Beyond fixed robots with a limited access to observations of their surroundings, we consider cameras that freely move in the physical world and explore online scene reconstruction at the level of objects from a stream of posed RGB-D frames. We model objects as neural implicit representations, entailing feature grids and small perceptrons, optimized per scene with differentiable rendering. We propose a feature grid interpolation scheme to adapt to novel views of yet unseen object parts, as well as a relocalization approach to reuse object models in novel scenes and an update procedure synthesizing views from past viewpoints to increase the completion of reconstructed objects in novel sequences. Finally, we focus on robot navigation towards objects specified as categories in unknown environments, a task requiring accurate scene understanding and efficient exploration. We introduce an online frontier-object mapping with rich visual and semantic representations of frontiers, or boundaries of the explored area, and object instances. Our navigation strategy combines a high-level goal prediction stage relying on a vision-language model endowed with learnt navigation-specific encoders and decoders and a low-level path planner that generates trajectories. Our modular framework, dubbed FOM-Nav for Frontier-Object Maps, is trained on an automatically self-collected large-scale navigation dataset in scanned environments and significantly improves exploration efficiency over prior works.
8.5.5 Learning Visuomotor Policies for Robotic Manipulation
Participants: Ricardo Garcia.
This thesis 40 focuses on the development of representations and learning algorithms to perform visually-guided robotic manipulation tasks in unstructured environments. One of the final goals of robotics is to train robots that can autonomously solve a wide range of tasks in the real world based on human instructions. To make progress towards this goal, this manuscript covers three main challenges: closing the sim-to-real gap for visuomotor policies, integrating 3D point cloud representations with language instructions to improve the performance of robotic manipulation in multi-task settings, and developing a generalist language-guided visuomotor policy for robotic manipulation. We first address the challenge of sim-to-real transfer in robotic manipulation. Training policies in simulation is less time-consuming and safer than in the real world. However, the discrepancies between simulation and the real environment can limit the transferability of policies trained in simulation to the real world. This issue is known as the sim-to-real gap, and domain randomization (DR) is a known technique to address this gap. DR allows to perform sim-to-real policy transfer by randomizing the simulation's appearance (textures, lighting, object colors, and camera viewpoints) during training. However, selecting the right range of parameters randomization is not trivial. This thesis proposes a data-driven strategy to systematically select the DR parameters using multi-object localization as a proxy task. We then focus on language-guided robotic manipulation and propose PolarNet and 3D-LOTUS, two 3D point cloud-based methods to integrate visual inputs and language instructions to predict manipulation actions. Both methods use efficient point cloud encoders and multimodal transformers to combine the text instructions and the geometric information from point clouds, enabling more accurate and efficient manipulation than 2D image-based approaches. Both 3D-based policies outperform state-of-the-art models across various multi-task settings of the RLBench benchmark and successfully transfer to the real-world robot, highlighting their performance in diverse environments. The last part of this thesis focuses on developing generalist robot policies for robotic manipulation. First, we propose a GemBench, a comprehensive benchmark for evaluating the generalization capabilities of such policies on a set of tasks with four levels of increasing difficulty: (1) novel object placements, (2) novel rigid objects, (3) novel articulated objects, and (4) long-horizon tasks that require sequential planning. We then propose 3D-LOTUS++, which extends our point cloud-based policy 3D-LOTUS by incorporating large language models (LLMs) for task planning and vision-language models (VLMs) for object grounding. This modular framework achieves state-of-the-art performance on this new benchmark. Through these contributions, this thesis advances the development of robust, precise, and generalist visuomotor policies for robotic manipulation.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
9.1.1 Louis Vuitton/ENS chair on artificial intelligence
Participants: Jean Ponce.
The scientific chair Louis Vuitton - École normale supérieure in Artificial Intelligence has been created in 2017 and inaugurated on April 12, 2018 by the ENS Director Marc Mézard and the LV CEO Michael Burke. The goal of the chair is to establish a close collaboration between LV and ENS in the area of Artificial Intelligence. The chair enjoys the generous annual contribution of 200K Euros provided by LV in support of research activities in statistical learning and computer vision. In particular, the chair supports the costs of researchers, students, missions, computational resources as well as seminars and meetings, including the two days of meeting annually organized by LV and ENS. During 2020 ENS and LV have organized several joint meetings with the participation of researchers from SIERRA and WILLOW teams. The chair has also supported the hiring of one PhD student at the WILLOW team, missions to conferences and international research labs as well as data collection for research projects. In 2020 the chair has been extended to the next three-year period until 2023. We are planning to start a CIFRE PhD of François Gardères together with Louis Vuitton in 2023.
9.1.2 Casino/ENS chair on algorithmic and machine learning
Participants: Justin Carpentier.
The scientific chair Casino/ENS - École normale supérieure on algorithmic and machine learning has been created in 2021. J. Carpentier is in charge of the robotics axis of this chair.
10 Partnerships and cooperations
10.1 International research visitors
10.1.1 Visits of international scientists
Other international visits to the team
Marc Toussaint
-
Status
Professor
-
Institution of origin:
TU Berlin
-
Country:
Germany
-
Dates:
1 month
-
Context of the visit:
collaboration
-
Mobility program/type of mobility:
research stay
Mike Tarr
-
Status
Professor
-
Institution of origin:
Carnegie Mellon University
-
Country:
US
-
Dates:
1 month
-
Context of the visit:
collaboration
-
Mobility program/type of mobility:
sabbatical
Baohe Zhang
-
Status
PhD student
-
Institution of origin:
University of Freiburg
-
Country:
Germany
-
Dates:
5 months
-
Context of the visit:
Collaboration on world modeling for robotics
-
Mobility program/type of mobility:
research stay
10.2 European initiatives
10.2.1 Horizon Europe
AGIMUS
AGIMUS project on cordis.europa.eu
-
Title:
Next generation of AI-powered robotics for agile production
-
Duration:
From October 1, 2022 to September 30, 2026
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
- AIRBUS, France
- KLEEMANN HELLAS SA (KLEEMANN HELLAS SA), Greece
- PAL ROBOTICS SLU (PAL ROBOTICS), Spain
- Q-PLAN INTERNATIONAL ADVISORS PC (Q-PLAN INTERNATIONAL), Greece
- PAL FRANCE, France
- THIMM OBALY, K.S., Czechia
- CESKE VYSOKE UCENI TECHNICKE V PRAZE (CVUT), Czechia
- CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE CNRS (CNRS), France
-
Inria contact:
Justin Carpentier
- Coordinator:
-
Summary:
AGIMUS aims to deliver an open-source breakthrough innovation in AI-powered agile production, introducing solutions that push the limits of perception, planning, and control in robotics, enabling general-purpose robots to be quick to set-up, autonomous and to easily adapt to changes in the manufacturing process. To achieve such agile production, AGIMUS leverages on cutting-edge technologies and goes beyond the state-of-the-art to equip current mobile manipulators with a combination of (i) an advanced task and motion planner that can learn from online available video demonstrations; (ii) optimal control policies obtained from advances in reinforcement learning based on efficient differentiable physics simulations of the manufacturing process; as well as (iii) advanced perception algorithms able to handle objects and situations unseen during initial training. Along the way, optimization of energy efficiency and the use of 5G technology will support further pushing the limits of autonomy. The AGIMUS solutions and their impact will be demonstrated and thoroughly stress-tested in 3 testing zones, as well as 3 industrial pilots in Europe, under numerous diverse real-world case studies and scenarios (different tools, environments, processes, etc.). In every step, and from the very beginning, AGIMUS will go beyond current norms and involve a wide range of stakeholders, starting from the production line itself, to identify the essential ethical-by-design principles and guidelines that can maximise acceptance and impact.
ARTIFACT
ARTIFACT project on cordis.europa.eu
-
Title:
The Artificial Motion Factory
-
Duration:
From September 1, 2025 to August 31, 2030
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Justin Carpentier
- Coordinator:
-
Summary:
Todays robots are confined to tightly controlled environments: even the complex choreographies that the Atlas humanoid flawlessly executes heavily rely on handcrafted control strategies and detailed workspace models, with little place for sensing. To put it bluntly, robots are nowhere near the level of agility, dexterity, and even less so autonomy, robustness, and safety required for their deployment in the wild alongside people.
The tenet of ARTIFACT is that the key to an actual revolution will come from the algorithmic foundations of artificial motion intelligence, an AI challenged from the start to interact physically with dynamic environments and, ultimately, people. To do so, we will break away from the dichotomy between optimal control, where the role of perception is traditionally limited to an early state estimation stage, and reinforcement learning, where control policies are typically learned model-free with no guarantee to cope with the curse of dimensionality.
In ARTIFACT, we will devise a unified, structured, modular, and learnable control architecture for providing robots with advanced decision-making capabilities to solve complex tasks and face new interactions as they experience the world. It will leverage the notion of differentiable programming at all scales to enable robots to (i) capture models of their interactions directly from a sound combination of sensor data and first principles from physics, (ii) autonomously discover new complex gestures and movements leveraging their past experiences, and (iii) learn embodied representations to control their interactions finely and reason about the physical world. It will be implemented in open-source software and shown in real-world and challenging scenarios requiring fine dexterity and high agility. Altogether, these contributions will be the key enablers to enhance robot autonomy fundamentally, thus opening the age of ubiquitous robots at the service of mankind.
ExTRAORDiNary
ExTRAORDiNary project on cordis.europa.eu
-
Title:
Accelerating Differentiable Robot Dynamics Simulation for Advanced Control
-
Duration:
From May 1, 2025 to April 30, 2027
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Justin Carpentier
- Coordinator:
-
Summary:
Differentiable robot dynamics simulation is a crucial enabler of advanced robot control. It is at the heart of both model predictive control (MPC) and learning-based approaches (e.g. reinforcement learning [RL]), which are among the most successful and actively researched robot control algorithms. Increased usage of the computationally demanding MPC/RL controllers has led to a growing need for efficient dynamics simulators. However, existing simulators internally use inefficient high-complexity (worst-case cubic) constrained dynamics algorithms (CDA) and are often inefficiently implemented leading to a slowdown of several factors compared to a fast simulator like Pinocchio.
Addressing these concerns, we will accelerate the differentiable simulation through three complementary strategies. We will 1) leverage low-complexity CDAs, 2) use Pinocchio's proven efficient software design patterns and explore further acceleration via code generation computations, and 3) derive efficient algorithms for differentiating through contact simulation.
Furthermore, our simulator will solve the nonlinear complementarity problem of frictional contact without making physics-compromising relaxations like existing simulators and will be publicly available as part of the widely used open-source Pinocchio library. By adding key enhancements to Pinocchio, we will make it a viable alternative to the inefficient, but feature-rich software simulators. The visibility, impact and usability of our simulator will be enhanced by addressing some low-hanging fruits in MPC, RL and physics identification applications.
This projects contributions will not only pave the way towards fast whole-body controllers and faster and more sustainable RL training (important a time surge of RL research activity), but will also impact adjacent fields like bio-mechanics and computer graphics in the long term
LiftMeUp
LiftMeUp project on cordis.europa.eu
-
Title:
LiftMeUp: Globally optimal algorithms for dexterous manipulation and locomotion
-
Duration:
From May 1, 2025 to April 30, 2027
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Justin Carpentier
- Coordinator:
-
Summary:
Robots bear the potential to help solve the world’s pressing problems by enabling and scaling up operations beyond human capacities. To successfully manipulate objects and perform reliable locomotion, robots require adequate models and solvers. Traditionally, physics-based models and iterative solvers are used, and obtaining reliable solutions requires significant effort in model tuning and heuristics for good convergence. LiftMeUp’s objective is to combine data-driven modeling with globally optimal solvers in a unique way to create an easy-to-use framework for the life-long operation of robots in challenging tasks. The result is a transparent, sample-efficient alternative to the less interpretable and resource-hungry deep-learning solutions for robotics. Furthermore, LiftMeUp builds on providing certifiably optimal methods with important consequences for safety and efficiency, as opposed to deep learning and local solvers, where different initializations can lead to entirely different solutions.
LiftMeUp is carried out at WILLOW, Inria Paris, known for cutting-edge control and locomotion research, and has three stages: first, combining concepts from Koopman theory, polynomial optimization, and kernel methods, lifting functions are inferred from data and integrated into globally optimal methods for state estimation and control. Second, different models are optimally combined, leading to a modular framework that can be incrementally updated online. Lastly, these novel algorithms are implemented on hardware to solve real-world locomotion and dexterous manipulation tasks.
This framework will have an important scientific impact by creating novel connections between global optimization and machine learning, enabling the use of principled over heuristic solvers in a broad range of applications in robotics and beyond. It will entail energy and time savings for the economy and using sample-efficient and transparent models will democratize technology and build trust.
10.3 National initiatives
10.3.1 PRAIRIE
Participants: Justin Carpentier, Jean Ponce, Cordelia Schmid.
The Prairie Institute (PaRis AI Research InstitutE) is one of the four French Institutes for Interdisciplinary Artificial Intelligence Research (3IA), which were created as part of the national French initiative on AI announced by President Emmanuel Macron on May 29, 2018. It brings together five academic partners (CNRS, Inria, Institut Pasteur, PSL University, and University of Paris) as well as 17 industrial partners, large corporations which are major players in AI at the French, European and international levels, as well as 45 Chair holders, including three of the members of WILLOW (Carpentier, Ponce, Schmid). Ponce is the scientific director of PRAIRIE.
10.3.2 PR[AI]RIE-PSAI
Participants: Justin Carpentier, Stephane Caron, Shizhe Chen, Jean Ponce, Cordelia Schmid.
PR[AI]RIE-PSAI (Paris School of AI) is one of the 9 French AI-Clusters financed by France 2030 for a total of 75m€ over 5 years (2024-2029). Created in 2019 by CNRS, Inria, Institut Pasteur, PSL University, Université de Paris Cité, and a club of industrial partners, PaRis Artificial Intelligence Research InstitutE (PR[AI]RIE) was one of the 4 Interdisciplinary Institutes for AI research set up as part of the national strategy for Artificial Intelligence announced by the President of the French Republic in March 2018. Starting in 2024 it has evolve to become the PR[AI]RIE Paris School of AI (PR[AI]RIE-PSAI) and cover the triptych of research-training-innovation. PR[AI]RIE-PSAI's activities are supported by 125 internationally renowned scientists, specialists in AI with diverse fields of application, such as biology, health, physics, transport or the environment, working in collaboration with public and private actors in these sectors. It includes the five faculties of WILLOW.
10.3.3 VideoPredict: Predicting future video content
Participants: Cordelia Schmid, Jean Ponce.
Predicting future video content is a challenging problem with high potential impact in downstream tasks such as self-driving cars and robotics, but also much promise for the learning process itself, from self-supervised learning to data augmentation. Existing approaches range from predicting future actions with semantic labels to creating realistic renderings of future frames. Most of them use straight predictions from convolutional features of previous frames. We propose instead to model the causality effects involved in the video formation process, and disentangle motion and appearance factors. This will result in better prediction, but also and maybe more importantly in a better, more structured understanding of the video content, leading to explicable and interpretable results, and eventually to more trustworthy learning systems. The German and French partners are, respectively, experts in machine learning and computer vision, with complementary research threads in causality and disentangled data models on the one hand, and video understanding and action recognition on the other hand, that are ideally suited for this collaborative project
10.3.4 PEPR Organic Robotics
Participants: Justin Carpentier, Stephane Caron, Megane Millan, Umit Bora Gokbakan, Etienne Menager.
The PEPR O2R "Organic Robotics" aims to initiate a change in robotics to create a new generation of robots capable of fluid and natural interactions with users, of social adaptation in their interactions, and which accompanies the technological transitions of societies by producing adapted, responsive and reliable services to citizens. In the frame of this national program, WILLOW is involved in Structuring Action 2 (Robot motion with physical interactions and social adaptation) led by Philippe Souères at LAAS-CNRS, and Structuring Action 4 (Modelling, Simulation, Multi-scale, and Biomechanics) led by Jérémie Dequidt at Inria DEFROST. J. Carpentier is also a member of the executive committee of the PEPR.
10.3.5 ARTIFACT (ANR Tremplin): La fabrique du mouvement artificiel
Participants: Justin Carpentier, Megane Millan, Franki Nguimatsia Tiofack.
Les robots modernes restent confinés dans des environnements étroitement contrôlés et même les chorégraphies complexes que les humanoïdes de Boston Dynamics exécutent sans faille, dépendent fortement de la capture des mouvements, de stratégies de contrôle élaborées à la main et de modèles détaillés de l'espace de travail, qui laissent peu de place à la perception. En clair, les robots sont loin d'atteindre le niveau d'agilité et de dextérité, et encore moins l'autonomie, la robustesse et la sécurité nécessaires à leur déploiement "dans la nature" aux côtés de l'homme. Un bond en avant dans ces capacités est nécessaire pour qu'ils tiennent leurs promesses et sortent véritablement du laboratoire. Notre principe est que la clé de cette révolution est le développement des fondements théoriques et algorithmiques d'une véritable intelligence artificielle du mouvement, une IA qui doit relever le défi supplémentaire d'interagir physiquement avec des environnements en évolution dynamique et, en fin de compte, avec des personnes. Nous romprons avec la dichotomie entre le contrôle optimal, où le rôle de la perception est traditionnellement limité à une étape précoce d'estimation de l'état, et l'apprentissage par renforcement, où les politiques de contrôle sont généralement apprises sans modèle, sans garantie de faire face à la malédiction de la dimensionnalité. Concrètement, nous utiliserons le formalisme de Koopman des systèmes dynamiques complexes pour apprendre les modèles sensorimoteurs et les stratégies de contrôle correspondantes à partir des données des capteurs. Nous développerons des méthodes puissantes pour apprendre, contrôler et partager un dictionnaire de synergies sensorimotrices à travers les tâches, faisant écho à celles utilisées par le système nerveux central humain dans les tâches quotidiennes et accélérant l'acquisition de nouvelles compétences. Nous tirerons parti de la composition des stratégies sensorimotrices et des stratégies de recherche arborescente alimentées par des réseaux neuronaux pour planifier de manière optimale les mouvements du robot sous des contraintes d'observation dynamiques. Le cadre proposé sera mis en œuvre dans de nouvelles architectures logicielles de programmation différentiable et démontré sur plusieurs tâches de locomotion et de manipulation, à la fois en simulation et sur des robots réels.
10.3.6 NIMBLE (ANR JCJC): Inexact optimization for robot control
Participants: Justin Carpentier, Stephane Caron, Etienne Arlaud, Jean Ponce, Oumayma Bounou, Joris Vaillant, Wilson Jallet, Frederike Dumbgen.
The limited agility and dexterity of modern robots prevent them from being deployed outside of laboratories, not even mentioning outside of factories. With NIMBLE, we want to point the classical sense-plan-act design pattern, widely adopted in robotics, as one of the main limiting factor. We propose to replace this three-part control paradigm by learning, from real robot experiments, a predictive model of the robot sensorimotor capabilities. This sensorimotor model will be notably exploited to take complex decisions generalizing to unforeseen situations directly from sensor measurements. While NIMBLE’s innovation takes its roots in the observation of the human motor control organization, it is grounded by advanced and principled mathematical methodologies, in particular, the Koopman operator sitting on top of (deep) learning, and exploits our recognized expertise in robot modeling, optimal control and machine learning for real robots. It will notably enable complex tasks to be defined and executed directly in the sensor space. The success of NIMBLE will be asserted by clear benchmarks in quadrupedal locomotion able to optimally adapt to unstructured terrains and in mobile manipulation for opening unknown doors using the sound combination of force and visual feedback.
10.3.7 INEXACTE (ANR PRCE): Inexact optimization for robot control
Participants: Justin Carpentier, Stephane Caron, Etienne Arlaud, Antoine Bambade, Joris Vaillant, Wilson Jallet.
Robotic systems are expected to take a large place in tomorrow’s society, far beyond current industrial robots in tightly controlled factory environments, with large impacts in terms of safety, health at work, comfort and productivity. The motion of robots is typically designed and controlled by specifying numerical objectives and constraints on what they must do, and within which limits. These specifications often conflict, and the actual control must be computed to satisfy all of them in the best possible way. This is naturally achieved by solving a numerical optimization problem. Such problems are often small enough in robotics that they can be solved exactly in theory, but they are always based on models, and by definition, models reflect reality imperfectly, even more so as we get away from tightly controlled (factory) environments.
We propose a complete change of paradigm, to acknowledge that we actually solve inaccurate optimization problems that provide inaccurate solutions by construction, and explore the following two hypotheses: (H1) We can obtain the exact same performance with imprecise numerical solutions, (H2) we can obtain these imprecise numerical solutions using less costly numerical methods, which can be computed faster, using less demanding hardware. To the best of our knowledge, these questions have barely been explored and INEXACT will provide the first comprehensive exploration of this topic.
Our short-term ambition is to significantly lower the computational requirements for solving control problems, taking advantage of the imprecisions inherent to robotics control to compute appropriate solutions faster. But ultimately, our long-term ambition is to design less fragile, less expensive and less polluting robots, since being less dependent on precise models can make us less dependent on precise and therefore complex, fragile, expensive and resource-demanding mechatronics.
10.3.8 3D-GEM (ANR JCJC): Learning Generalizable 3D-based Robotic Manipulation Policies
Participants: Shizhe Chen, Justin Carpentier, Stephane Caron, Cordelia Schmid, Jean Ponce.
Robotic manipulation in unstructured environments is a long-term goal, with the potential for significant societal and economic impacts such as in manufacturing and healthcare. However, current approaches suffer from significant limitations in generalization to novel environments, objects and tasks, which are essential for real-world applications. Most learning-based methods are trained and evaluated on a narrow range of tasks - typically basic pick-and-place skills, and focus on 2D images, lacking crucial 3D understanding. The 3D-GEM project aims to develop cutting-edge robotic manipulation systems by leveraging recent breakthroughs in artificial intelligence, particularly large language models and vision foundation models, to build 3D-based robotic manipulation foundation models. This initiative will establish a modular framework to tackle critical challenges, including data scarcity, generalization, dexterity, and efficiency. The project involves three key thrusts: (1) significantly enhancing the scale and quality of robot datasets; (2) advancing 3D embodied perception and task planning for comprehending complex 3D scenes and generating high-level grounded plans; (3) learning generalist 3D motion planning policies using multimodal sensors and model predictive control. These high-level and low-level modules will function in a closed-loop system to enable efficient task execution across diverse scenarios, ultimately improving the versatility and effectiveness of robotic systems.
10.4 Regional initiatives
10.4.1 AI4IDF
Participants: Justin Carpentier, Jean Ponce, Etienne Arlaud, Joris Vaillant, Pierre-Guillaume Raverdy.
Île-de-France is home to the world's largest mathematics community, several of France's largest computer science laboratories, but also a dense industrial fabric in artificial intelligence.
In this extremely rich context, the four main Artificial Intelligence (AI) institutes - DATAIA, Hi! PARIS, PRAIRIE and SCAI - propose to create an alliance to structure and animate the community, and to offer industrial and international partners a unified vision of the exceptional forces at work.
11 Dissemination
Participants: Ajay Sathya, Cordelia Schmid, Elliot Vincent, Etienne Menager, Etienne Arlaud, Francois Garderes, Fabian Schramm, Frederike Dumbgen, Franki Nguimatsia Tiofack, Gabriel Fiastre, Louis Montaut, Justin Carpentier, Jean Ponce, Shizhe Chen, Shiyao Li, Sara Pieri, Theotime Le Hellard, Matthieu Futeral-Peter, Paul Pacaud, Roland Andrews, Thomas Chabal, Valentin Tordjman–Levavasseur, Wilson Jallet, Umit Bora Gokbakan, Zeeshan Khan.
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Member of the organizing committees
- Ellis Workshop on Comp. Vision & Machine Learning, BadTeinach, April 2025 (C. Schmid)
- CVPR 2025 workshop on Generalization in Robotics Manipulation Workshop and Challenges, Nashville, June 2025 (S. Chen, C. Schmid)
- Video AI Symposium, Paris, September 2025 (C. Schmid)
- CoRL 2025 workshop on Open-Source Hardware in the Era of Robot Learning, Seoul, September 2025 (S. Caron)
- Workshop on Diverse Optimization and Exploration, Paris, November 2025 (J. Carpentier)
11.1.2 Scientific events: selection
Chair of conference program committees
- IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (J. Ponce, C. Schmid, S. Chen)
- International Conference on Computer Vision (ICCV) (J. Ponce, C. Schmid)
- European Conference on Computer Vision (ECCV) (C. Schmid, S. Chen)
- International Conference on Machine Learning (ICML) (C. Schmid, S. Chen)
- International Conference on Learning Representations (ICLR) (C. Schmid, S. Chen)
- Association for the Advancement of Artificial Intelligence (AAAI) (S. Chen)
- RSS Pioneers (A. Sathya, F. Dumbgen)
Member of the conference program committees
- Associate editor for the Humanoids 2025 conference (S. Caron, A. Sathya)
- Associate editor for the IROS 2025 conference (A. Sathya)
- Associate editor for the ICRA 2026 conference (A. Sathya)
Reviewer
- IEEE-RAS International Conference on Robotics and Automation (ICRA) (S. Caron, Ü. B. Gökbakan, F. N. Tiofack, T. L. Hellard, F. Schramm, A. Sathya, S. Chen, T. Chabal)
- IEEE-RAS International Conference on Humanoid Robots (Humanoids) (S. Caron)
- IEEE/RSJ International Conference on Intelligent Robots and System (IROS) (S. Caron, Ü. B. Gökbakan, F. Schramm, S. Li, L. Montaut, S. Chen)
- International Conference on Learning Representations (ICLR) (T. L. Hellard, S. Pieri)
- Annual Meeting of the Association for Computational Linguistics (ACL) (S. Pieri)
- Conference on Neural Information Processing Systems (NeurIPS) (S. Pieri, Z. Khan)
- IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (S. Pieri, S. Li, Z. Khan)
- International Conference on 3D Vision (3DV) (S. Li)
- International Conference on Machine Learning (ICML) (Z. Khan)
- Robotics: Science and Systems (RSS) (L. Montaut)
- IEEE International Conference on Automation Science and Engineering (CASE) (A. Sathya)
- International Conference on Computational Linguistics (COLING) (S. Chen)
- Conference on Language Modeling (COLM) (S. Chen)
- Conference on Robot Learning (CoRL) (S. Chen)
- International Conference on Computer Vision (ICCV) (S. Chen, T. Chabal)
11.1.3 Journal
Member of the editorial boards
- Associate editor, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (S. Chen)
- Associate editor, IEEE Transactions on Robotics (TRO) (J. Carpentier)
- Associate editor, IEEE Robotics and Automation Letters (RAL) (J. Carpentier)
Reviewer - reviewing activities
- IEEE Transactions on Robotics (T-RO) (J. Carpentier, S. Caron, L. Montaut, A. Sathya, S. Chen)
- IEEE Robotics and Automation Letters (RA-L) (S. Caron, Ü. B. Gökbakan, F. Schramm, E. Menager, L. Montaut, A. Sathya, S. Chen, T. Chabal)
- International Journal of Robotics Research (IJRR) (E. Menager, A. Sathya)
- IEEE-RAS International Conference on Soft Robotics (E. Menager)
- IEEE Transactions on Automation Science and Engineering (T-ASE) (A. Sathya)
- International Journal of Computer Vision (IJCV) (S. Chen)
11.1.4 Invited talks
- J. Carpentier, Conseil Scientifique d'Inria, Paris, Décembre 2025.
- J. Carpentier, Workshop on Diverse Optimization for Robotics, Paris, November 2025.
- J. Carpentier, 4th International Workshop on AI for Robotics, NAVER LABS Europe, Grenoble, November 2025.
- J. Carpentier, X-IA event, Paris, November 2025.
- J. Carpentier, Table ronde IA et Robotique, Cap Digital, Paris, November 2025.IEEE-RAS Polish Chapter
- J. Carpentier, IEEE-RAS Polish Chapter, Remote, September 2025.
- J. Carpentier, PAISS Summer School, Grenoble, September 2025.
- J. Carpentier, Table ronde Robotique et IA VivaTech, Paris, June 2025.
- J. Carpentier, RobotSoft workshop, Lausane, April 2025.
- J. Carpentier, AI Summit, Février 2025.
- C. Schmid, 4th International Workshop on AI for Robotics, NAVER LABS Europe, Grenoble, November 2025.
- C. Schmid, Workshop on Multimodal Representation and Retrieval, in conjunction with ICCV’25, October 2025.
- C. Schmid, Festvortrag (ceremonial lecture) at Leopoldina annual meeting, Halle, September 2025.
- C. Schmid, Invited speaker at Video AI Symposium, Paris, September 2025.
- C. Schmid, Keynote at Building Bridge Conference, Dresden, 2025.
- C. Schmid, Invited speaker at Workshop on Pixel-level Vision Foundation Models, in conjunction with CVPR’25, June 2025.
- C. Schmid, Invited speaker at Workshop on Multimodal Algorithmic Reasoning, in conjunction with CVPR’25, June 2025.
- C. Schmid, Invited speaker at Workshop on ScanNet++ Novel View Synthesis and 3D Semantic Understanding, in conjunction with CVPR’25, June 2025.
- C. Schmid, Invited speaker at Workshop on Video Large Language Models, in conjunction with CVPR’25, June 2025.
- C. Schmid, Invited speaker at Workshop on Computer Vision in the Wild, in conjunction with CVPR’25, June 2025.
- C. Schmid, Seminar at Applied and Theoretical Aspects of Robot Intelligence Lab, TUM, Munich, December 2025.
- C. Schmid, Presentation at Workshop on Diverse Optimization and Exploration, Inria, Paris, November 2025.
- C. Schmid, Presentation at Malik Fest, University of California, Berkeley, October 2025.
- C. Schmid, Presentation at Ellis workshop, Bad Teinach, April 2025.
- J. Ponce, Forum on AI Frontiers 2025, Seoul, Korea, October 27.
- S. Caron, CoRL 2025 workshop on Open-Source Hardware in the Era of Robot Learning, September 2025.
- S. Caron, Harada Lab, The University of Osaka, Japan, October 2025.
- S. Caron, LAAS-CNRS, October 2025.
- A. Sathya, Robotics, Optimization, and Assistive Mobility (ROAM) Lab at University of Notre Dame, USA, July 2025.
- A. Sathya, EEE Department, IIT-Guwahati, Guwahati, India, August 2025.
- A. Sathya, Gepetto Team, LAAS-CNRS, Toulouse, October 2025.
- S. Chen, GDR IASIS workshop on Deformable Object Modelling Trends: from Perception to Applications, Paris, April 2025.
- S. Chen, Invited speaker at Workshop on Computer Vision in the Wild, in conjunction with CVPR’25, June 2025.
- S. Chen, Demi-heure de science, Inria Paris, July 2025.
- S. Chen, Korea University, October 2025.
- S. Chen, THOTH team, Inria Grenoble, November 2025.
- S. Chen, Workshop on Foundation Models in Robotics, Lyon, November 2025.
11.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
11.2.1 Teaching
- Course at Master MVA: Robotics (S. Caron, J. Carpentier, A. Sathya as teaching assistant)
- Course at ENS-PSL: Planification de mouvement en robotique et en animation graphique (S. Caron)
- Course at Dauphine-PSL: Formation Chef de projet IA (S. Caron)
- Lecture at Mines Paris-PSL (formation MAREVA): Reinforcement learning for robotics (S. Caron)
- Introduction to computer vision, NYU, Spring 2025 (J. Ponce)
- Introduction to computer vision, ENS-PSL, Fall 2025 (J. Ponce, G. Fiastre as teaching assistant)
- Computer vision, Chefs de projets IA, Société des Ingénieurs de l'Automobile, June 2025 (J. Ponce)
- Computer vision, Casino COMEX, We Are School, May 2025 (J. Ponce)
- Machine Learning and Applications (MALAP), École Nationale des Ponts et Chaussées, IP Paris (S. Li as teaching assistant)
- Deep Reinforcement Learning lecture at Master MVA: Deep Learning (S. Chen)
11.2.2 Supervision
PhD defenses
- Zerui Chen, advised by C. Schmid and S. Chen.
- Ricardo Garcia-Pinel, advised by C. Schmid and S. Chen.
- Thomas Chabal, advised by J. Ponce, C. Schmid and S. Chen.
- Matthieu Futeral-Peter, advised by R. Bawden (Inria ALMAnaCH), B. Sagot (Inria ALMAnaCH) and C. Schmid.
- Theo Bodrito, advised by J. Ponce and J. Mairal (Inria Grenoble).
- Elliot Vincent, advised by J. Ponce and M. Aubry (ENPC).
PhD students
- Théotime Le Hellard, started in Oct 2025, advised by J. Carpentier.
- Franki Nguimatsia Tiofack, started in Jan 2025, advised by J. Carpentier.
- Roland Andrews, started in Oct 2024, advised by J. Carpentier and A. Taylor (Sierra).
- Imen Mahdi (University of Freiburg), started in Oct 2024, advised by C. Schmid and Abhinav Valada (University of Freiburg).
- Romain Seailles, started in Sept 2024, advised by J. Ponce and J. Mairal (Inria Grenoble).
- Basile Terver (Meta), started in Nov 2024, advised by J. Ponce and Y. LeCun (Meta).
- Shiyao Li (ENPC), started in Oct 2024, advided by S. Chen and V. Lepetit (ENPC).
- Lucas Ventura, started in Oct 2022, advised by C. Schmid and G. Varol (ENPC).
- Fabian Schramm, started in Feb 2023, advised by J. Carpentier and N. Perrin-Gilbert (ISIR).
- Zeeshan Khan, started in Sept 2023, advised by S. Chen and C. Schmid.
- Gabriel Fiastre, started in Sept 2023, advised by C. Schmid.
- Ludovic de Matteis, started in Oct 2023, advised by J. Carpentier and N. Mansard (CNRS).
- U. Bora Gökbakan, started in June 2024, advised by S. Caron and P. Souères (CNRS).
- S. Pieri, started in Oct 2024, advised by S. Chen, C. Schmid and J. Sivic (Czech Technical University).
- P. Pacaud, started in Oct 2024, advised by S. Chen and C. Schmid.
- F. Gardères, started in May 2023, advised by S. Chen and J. Ponce.
- F. Porcher, started in May 2025, advised by N. Carion (Meta), K. Alahari (Inria Grenoble) and S. Chen.
11.2.3 Juries
- PhD committee of Marc Duclusaud, University of Bordeaux, France, December 2025 (S. Caron)
- CRCN recruitment jury for Inria Nancy, April 2025 (S. Caron)
- Timothée Darcet, Université Grenoble Alpes, July 2025 (C. Schmid)
- Théo Cachet, Sorbonne université, Paris, June 2025 (C. Schmid)
- Kumar Ashutosh, prelim UT Austin, May 2025 (C. Schmid)
- Corentin Sautier, ENPC, October 2025 (S. Chen)
- Ivan Lopes, Mines Paris – PSL, October 2025 (S. Chen)
- Smail Ait Bouhsain, LAAS-CNRS, April 2025 (J. Carpentier)
11.3 Popularization
11.3.1 Specific official responsibilities in science outreach structures
- "Carte blanche on AI" in Le Monde newspaper with Isabelle Ryl, 6 times a year (J. Ponce)
- Interview by Micode on Underscore_ on available on Youtube over 310 000 views (J. Ponce)
11.3.2 Productions (articles, videos, podcasts, serious games, ...)
- Popular science video for the Algorea platform (reach: 10k teachers, 1M students) in collaboration with France-IO (S. Caron)
12 Scientific production
12.1 Major publications
- 1 miscFOM-Nav: Frontier-Object Maps for Object Goal Navigation.December 2025HAL
- 2 miscGondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation.2025HALDOI
- 3 miscMaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos.October 2025HAL
- 4 miscA Data-driven Contact Estimation Method for Wheeled-Biped Robots.January 2025HAL
- 5 miscComposeAnything: Composite Object Priors for Text-to-Image Generation.May 2025HAL
- 6 miscSobolev Diffusion Policy.July 2025HAL
- 7 miscGuided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning.October 2025HAL
- 8 miscGuardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models.December 2025HAL
- 9 miscFirst-order Sobolev Reinforcement Learning.November 2025HAL
- 10 inproceedingsChapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs.CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern RecognitionNashville, United StatesMarch 2025HAL
12.2 Publications of the year
International journals
National journals
International peer-reviewed conferences
Conferences without proceedings
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications