EN FR
EN FR
WILLOW - 2025

2025Activity​‌ reportProject-TeamWILLOW

RNSR:​​ 200718311C
  • Research center Inria​​​‌ Paris Centre
  • In partnership​ with:Ecole normale supérieure​‌ de Paris, CNRS
  • Team​​ name: Embodied computer vision​​​‌
  • In collaboration with:Département​ d'Informatique de l'Ecole Normale​‌ Supérieure

Creation of the​​ Project-Team: 2021 May 01​​​‌

Each year, Inria research​ teams publish an Activity​‌ Report presenting their work​​ and results over the​​​‌ reporting period. These reports​ follow a common structure,​‌ with some optional sections​​ depending on the specific​​​‌ team. They typically begin​ by outlining the overall​‌ objectives and research programme,​​ including the main research​​​‌ themes, goals, and methodological​ approaches. They also describe​‌ the application domains targeted​​ by the team, highlighting​​​‌ the scientific or societal​ contexts in which their​‌ work is situated.

The​​ reports then present the​​ highlights of the year,​​​‌ covering major scientific achievements,‌ software developments, or teaching‌​‌ contributions. When relevant, they​​ include sections on software,​​​‌ platforms, and open data,‌ detailing the tools developed‌​‌ and how they are​​ shared. A substantial part​​​‌ is dedicated to new‌ results, where scientific contributions‌​‌ are described in detail,​​ often with subsections specifying​​​‌ participants and associated keywords.‌

Finally, the Activity Report‌​‌ addresses funding, contracts, partnerships,​​ and collaborations at various​​​‌ levels, from industrial agreements‌ to international cooperations. It‌​‌ also covers dissemination and​​ teaching activities, such as​​​‌ participation in scientific events,‌ outreach, and supervision. The‌​‌ document concludes with a​​ presentation of scientific production,​​​‌ including major publications and‌ those produced during the‌​‌ year.

Keywords

Computer Science​​ and Digital Science

  • A3.1.1.​​​‌ Modeling, representation
  • A3.4. Machine‌ learning and statistics
  • A5.3.‌​‌ Image processing and analysis​​
  • A5.10. Robotics
  • A9. Artificial​​​‌ intelligence
  • A9.1. Knowledge
  • A9.2.‌ Machine learning
  • A9.3. Signal‌​‌ processing
  • A9.5. Robotics and​​ AI
  • A9.7. AI algorithmics​​​‌
  • A9.12. Computer vision

Other‌ Research Topics and Application‌​‌ Domains

  • B9.5.1. Computer science​​
  • B9.5.6. Data science

1​​​‌ Team members, visitors, external‌ collaborators

Research Scientists

  • Justin‌​‌ Carpentier [Team leader​​, INRIA, Researcher​​​‌]
  • Stephane Caron [‌INRIA, Associate Professor‌​‌ Detachement, HDR]​​
  • Shizhe Chen [INRIA​​​‌, Researcher]
  • Yann‌ Dubois De Mont-Marin [‌​‌INRIA, Starting Research​​ Position, from May​​​‌ 2025]
  • Frederike Dumbgen‌ [INRIA, Starting‌​‌ Research Position, until​​ Apr 2025]
  • Wilson​​​‌ Jallet [INRIA,‌ Starting Research Position,‌​‌ from May 2025]​​
  • Louis Montaut [INRIA​​​‌, Starting Research Position‌, from May 2025‌​‌]
  • Jean Ponce [​​ENS PARIS, Senior​​​‌ Researcher, HDR]‌
  • Cordelia Schmid [INRIA‌​‌, Senior Researcher,​​ HDR]

Post-Doctoral Fellows​​​‌

  • Ewen Dantec [ENS‌ Paris, Post-Doctoral Fellow‌​‌, until Nov 2025​​]
  • Yann Dubois De​​​‌ Mont-Marin [INRIA,‌ Post-Doctoral Fellow, until‌​‌ Apr 2025]
  • Frederike​​ Dumbgen [INRIA,​​​‌ Post-Doctoral Fellow, from‌ May 2025]
  • Wilson‌​‌ Jallet [INRIA,​​ Post-Doctoral Fellow, until​​​‌ Apr 2025]
  • Quentin‌ Le Lidec [INRIA‌​‌, Post-Doctoral Fellow,​​ until May 2025]​​​‌
  • Etienne Menager [INRIA‌, Post-Doctoral Fellow]‌​‌
  • Louis Montaut [INRIA​​, Post-Doctoral Fellow,​​​‌ until Apr 2025]‌
  • Etienne Moullet [INRIA‌​‌, Post-Doctoral Fellow,​​ until Sep 2025]​​​‌
  • Ajay Sathya [INRIA‌, Post-Doctoral Fellow]‌​‌

PhD Students

  • Roland Andrews​​ [INRIA]
  • Theo​​​‌ Bodrito [INRIA,‌ until Apr 2025]‌​‌
  • Thomas Chabal [INRIA​​, until Aug 2025​​​‌]
  • Zerui Chen [‌INRIA, until Apr‌​‌ 2025]
  • Ludovic De​​ Matteis [UNIV TOULOUSE​​​‌ III, until Apr‌ 2025]
  • Gabriel Fiastre‌​‌ [INRIA]
  • Matthieu​​ Futeral-Peter [INRIA,​​​‌ until Jun 2025]‌
  • Ricardo Garcia Pinel [‌​‌INRIA, until Jun​​ 2025]
  • Francois Garderes​​​‌ [LOUIS VUITTON,‌ CIFRE]
  • Umit Bora‌​‌ Gokbakan [INRIA]​​
  • Zeeshan Khan [INRIA​​​‌]
  • Theotime Le Hellard‌ [INRIA, from‌​‌ Oct 2025]
  • Shiyao​​​‌ Li [ENPC]​
  • Javier Alejandro Lopetegui Gonzalez​‌ [INRIA, from​​ Nov 2025]
  • Imen​​​‌ Mahdi [University of​ Freiburg]
  • Franki Nguimatsia​‌ Tiofack [INRIA]​​
  • Paul Pacaud [INRIA​​​‌]
  • Sara Pieri [​INRIA]
  • François Porcher​‌ [Meta, CIFRE]​​
  • Mathis Scheffler [INRIA​​​‌, from Oct 2025​]
  • Fabian Schramm [​‌INRIA]
  • Romain Seailles​​ [ENS Paris]​​​‌
  • Federica Spinola [INRIA​, from Oct 2025​‌]
  • Basile Terver [​​FACEBOOK, CIFRE]​​​‌
  • Valentin Tordjman–Levavasseur [INRIA​, from Oct 2025​‌]
  • Lucas Ventura [​​ENPC]
  • Elliot Vincent​​​‌ [Ministère Transition,​ until Jun 2025]​‌

Technical Staff

  • Etienne Arlaud​​ [INRIA, Engineer​​​‌]
  • Walid Bousselham [​INRIA, Engineer,​‌ from May 2025 until​​ Jun 2025]
  • Riccardo​​​‌ Cadei [INRIA,​ Engineer, from Mar​‌ 2025 until Jul 2025​​]
  • Timothee Carecchio [​​​‌INRIA, Engineer,​ from Sep 2025]​‌
  • Aamr El Kazdadi [​​INRIA, Engineer,​​​‌ from Oct 2025]​
  • Pierre Fabre [INRIA​‌, Engineer, from​​ Jun 2025]
  • Lucas​​​‌ Haubert [INRIA,​ Engineer]
  • Qikai Huang​‌ [UNIV GEORGIA,​​ Engineer, from Dec​​​‌ 2025]
  • Peteris Kulits​ [INRIA, Engineer​‌, from Jun 2025​​ until Nov 2025]​​​‌
  • Louise Manson [INRIA​, Engineer]
  • Jeanne​‌ Matheron [INRIA,​​ from Nov 2025]​​​‌
  • Megane Millan [INRIA​, Engineer, until​‌ Oct 2025]
  • Louis​​ Nel [INRIA,​​​‌ Engineer, from Dec​ 2025]
  • Valentin Tordjman–Levavasseur​‌ [INRIA, until​​ Sep 2025]
  • Joris​​​‌ Vaillant [INRIA,​ Engineer]

Interns and​‌ Apprentices

  • Joaquin Austin Ferro​​ [INRIA, Apprentice​​​‌, from Sep 2025​]
  • Timothee Carecchio [​‌INRIA, Intern,​​ from Feb 2025 until​​​‌ Aug 2025]
  • Radu​ Cristian [INRIA,​‌ Intern, from Mar​​ 2025 until Aug 2025​​​‌]
  • Theotime Le Hellard​ [INRIA, from​‌ Sep 2025 until Sep​​ 2025]
  • Theotime Le​​​‌ Hellard [ENS Paris​, Intern, from​‌ Apr 2025 until Aug​​ 2025]
  • Romain Li​​​‌ [INRIA, Intern​, from Jul 2025​‌ until Sep 2025]​​
  • Armand Modjabi [INRIA​​​‌, Intern, from​ Mar 2025 until Aug​‌ 2025]
  • Louis Nel​​ [INRIA, Intern​​​‌, from Jun 2025​ until Nov 2025]​‌
  • Matthieu Rouet [ENS​​ Paris, Intern,​​​‌ from Jun 2025 until​ Aug 2025]
  • Mathis​‌ Scheffler [INRIA,​​ Intern, from Apr​​​‌ 2025 until Sep 2025​]

Administrative Assistant

  • Marina​‌ Kovacic [INRIA]​​

Visiting Scientists

  • Qikai Huang​​​‌ [UNIV GEORGIA,​ until Nov 2025]​‌
  • Kim Jae Myung [​​UNIV TUBINGEN, until​​​‌ May 2025]
  • Victor​ Klemm [INSTITUT ETH​‌, from May 2025​​ until Jun 2025]​​​‌
  • Michael Tarr [CMU​, from Sep 2025​‌ until Oct 2025]​​
  • Baohe Zhang [ University​​​‌ of Freiburg, from​ Aug 2025 until Nov​‌ 2025]

External Collaborators​​

  • Antoine Hoarau [Self-employee​​, from Oct 2025​​​‌]
  • Theotime Le Hellard‌ [ENS de Paris‌​‌, until Apr 2025​​]

2 Overall objectives​​​‌

2.1 Statement

Building machines‌ that can automatically understand‌​‌ complex visual inputs is​​ one of the central​​​‌ scientific challenges in artificial‌ intelligence. Truly successful visual‌​‌ understanding technology will have​​ a broad impact in​​​‌ application domains as varied‌ as defense, entertainment, healthcare,‌​‌ human-computer interaction, image retrieval​​ and data mining, industrial​​​‌ and personal robotics, manufacturing,‌ scientific image analysis, surveillance‌​‌ and security, and transportation.​​

The problem is, however,​​​‌ very difficult due to‌ the large variability of‌​‌ the visual world and​​ the high complexity of​​​‌ the underling physical phenomena.‌ For example, people easily‌​‌ learn how to perform​​ complex tasks such as​​​‌ changing a car tire‌ or performing resuscitation by‌​‌ observing other people. This​​ involves advanced visual perception​​​‌ and interaction capabilities including‌ interpreting sequences of human‌​‌ actions, learning new visuomotor​​ skills from only a​​​‌ few example demonstrations, grounding‌ instructions in appropriate scene‌​‌ elements and actions, and​​ applying the learned skills​​​‌ in new environments and‌ situations. Currently, however, there‌​‌ is no artificial system​​ with a similar level​​​‌ of cognitive visual competence.‌ Our goal for the‌​‌ next 10 years is​​ to develop models, methods​​​‌ and algorithms providing sufficient‌ level of visual intelligence‌​‌ to enable applications such​​ as personal visual assistants​​​‌ or home robots that‌ will, for example, prepare‌​‌ a meal in response​​ to a chat request.​​​‌

Despite the tremendous progress‌ in visual recognition in‌​‌ the last decade, current​​ visual recognition systems still​​​‌ require large amounts of‌ carefully annotated training data,‌​‌ often use black-box architectures​​ that do not model​​​‌ the 3D physical nature‌ of the visual world,‌​‌ are typically limited to​​ simple pattern recognition tasks​​​‌ such as detecting and‌ recognizing objects from a‌​‌ predefined vocabulary, and do​​ not capture real-world semantics.​​​‌ We plan to address‌ these limitations with an‌​‌ ambitious research program that​​ aims at developing models​​​‌ of the entire visual‌ understanding process from image‌​‌ acquisition to the high-level​​ embodied interpretation of visual​​​‌ scenes. We target learnable‌ models that require minimal‌​‌ to no supervision, support​​ complex reasoning about visual​​​‌ data, and are grounded‌ in interactions with the‌​‌ physical world. More concretely,​​ we will address fundamental​​​‌ scientific challenges along three‌ research axes: (i) visual‌​‌ recognition in images and​​ videos with an emphasis​​​‌ on weakly supervised learning‌ and models grounded in‌​‌ the physical 3D world;​​ (ii) learning embodied visual​​​‌ representations for robotic manipulation‌ and locomotion; and (iii)‌​‌ image restoration and enhancement.​​ These challenges will be​​​‌ tackled by a team‌ of researchers with core‌​‌ expertise in computer vision​​ and robotics, who will​​​‌ simultaneously advance both fields‌ towards convergence. The complementary‌​‌ expertise in areas such​​ as machine learning and​​​‌ natural language understanding will‌ be gained through collaboration‌​‌ with relevant research teams.​​

We believe that foundational​​​‌ research should be grounded‌ in applications and we‌​‌ plan to pursue applications​​ with high scientific, societal,​​​‌ and/or economic impact in‌ domains such as transportation;‌​‌ augmented reality; education; advanced​​​‌ manufacturing; and quantitative visual​ analysis in sciences, humanities​‌ and healthcare.

3 Research​​ program

3.1 Visual recognition​​​‌ and reconstruction of images​ and videos

It is​‌ now possible to efficiently​​ detect individual objects and​​​‌ people in cluttered images​ and videos. Current methods,​‌ however, rely on large-scale,​​ manually-annotated image collections, often​​​‌ use black-box architectures that​ do not model the​‌ 3D physical nature of​​ the visual world, and​​​‌ are typically limited to​ simple pattern recognition tasks.​‌ In this part of​​ research program, we address​​​‌ these fundamental limitations. In​ particular, we address the​‌ following three key open​​ challenges: (i) how to​​​‌ leverage available but weak​ annotations including text, audio​‌ and speech, (ii) how​​ to enable automatic reasoning​​​‌ about visual data, and​ (iii) how to develop​‌ models grounded in the​​ physical 3D world including​​​‌ learnable models for 3D​ object and scene reconstruction.​‌ We also continue theoretical​​ work aimed at understanding​​​‌ the geometric underpinnings of​ computer vision.

Our current​‌ efforts in this area​​ are outlined in detail​​​‌ in Section. 8.1.​

3.2 Learning embodied representations​‌

Computer vision has come​​ a long way toward​​​‌ understanding images and videos​ in terms of scene​‌ geometry, object labels, locations​​ and poses of people​​​‌ or classes of human​ actions. This “understanding”, however,​‌ remains largely disconnected from​​ reasoning about the physical​​​‌ world. For example, what​ will happen when removing​‌ a tablecloth from a​​ set table? What actions​​​‌ will be needed to​ resume an interrupted meal?​‌ We believe that a​​ true embodied understanding of​​​‌ dynamic scenes from visual​ observations is the next​‌ major research challenge. We​​ address this challenge by​​​‌ developing new models and​ algorithms with an emphasis​‌ on the synergy between​​ vision, learning, robotics and​​​‌ natural language understanding. To​ this end, we study​‌ learning methods for motion​​ planning and optimal control​​​‌ for known environments in​ state space. At the​‌ same time, we develop​​ models and algorithms for​​​‌ learning visio-motor policies that​ do not rely on​‌ the known structure of​​ environments and instead integrate​​​‌ visual perception directly into​ control algorithms. We also​‌ address natural language providing​​ additional modality for more​​​‌ efficient learning and communication​ with emodied agents.

Our​‌ current efforts in this​​ area are outlined in​​​‌ detail in Section 8.2​.

3.3 Image restoration​‌ and enhancement

Although image​​ processing is a mature​​​‌ field, it is more​ important than ever with​‌ the advent of high-quality​​ camera phones, scientific applications​​​‌ in microscopy and astronomy​ and, recently, the emergence​‌ of multi-modal sensing for​​ autonomous cars for example.​​​‌ In addition, it is​ an excellent proving ground​‌ for learning-based techniques since​​ (a) it is in​​​‌ general (relatively) easy to​ generate realistic corrupted images​‌ from clean ones since​​ reasonable models of the​​​‌ physical image corruption problem​ as often available (Abdelhamed​‌ et al., 2019; Nah​​ et al., 2017), and​​​‌ (b) it is possible​ to incorporate natural image​‌ priors such as self-similarities​​ (Buades et al., 2005)​​​‌ and sparsity (Mairal et​ al., 2009) in the​‌ modelling and optimization processes.​​ We have conducted work​​ on image restoration since​​​‌ the time of Julien‌ Mairal's PhD thesis, addressing‌​‌ problems such as demosaicking,​​ denoising, inpainting, and inverse​​​‌ half-toning with a combination‌ of sparse coding/dictionary learning‌​‌ methods and non-local means,​​ then moving on to​​​‌ blind deblurring including motion‌ segmentation and, more recently,‌​‌ deep-learning methods. In our​​ on-going efforts we address​​​‌ several challenges for learning-based‌ approaches to image restoration:‌​‌ (i) how to combine​​ different modalities such as​​​‌ depth and RGB images‌ to improve the quality‌​‌ of the joint observations;​​ (ii) how to construct​​​‌ tunable, fully interpretable approaches‌ to image restoration in‌​‌ a functional framework; and​​ (iii) how to incorporate​​​‌ machine learning methods that‌ go beyond the traditional‌​‌ fully supervised setting into​​ the image restoration pipeline.​​​‌

Our current work in‌ this area is outlined‌​‌ in detail in Section​​ 8.4.

4 Application​​​‌ domains

We believe that‌ foundational modeling work should‌​‌ be grounded in applications.​​ This includes (but is​​​‌ not restricted to) the‌ following high-impact domains.

4.1‌​‌ Automated visual assistants

The​​ modern seamless video communication​​​‌ has enabled new applications‌ in education, medicine and‌​‌ manufacturing, such as remote​​ surgery or remotely-supervised product​​​‌ assembly. The abundance of‌ online instructional videos further‌​‌ confirms the high demand​​ of assistance including daily​​​‌ tasks such as cooking‌ and gardening. Our work‌​‌ on embodied video understanding​​ and on the joint​​​‌ modeling of vision and‌ language will support automatic‌​‌ visual assistants. Similar to​​ existing driving navigation assistants,​​​‌ such applications will guide‌ people in daily living,‌​‌ inspection and manufacturing tasks.​​ Some of these applications​​​‌ are studied within our‌ MSR-Inria collaboration.

4.2 Robotics‌​‌

In 2023, the Willow​​ team has pursued the​​​‌ development of the Pinocchio‌ library both from a‌​‌ scientific and software perspective.​​ The recent versions of​​​‌ Pinocchio now accounts for‌ closed-loop mechanisms (based on‌​‌ a proximal optimization), code​​ source generation on GPUs,​​​‌ etc. All these new‌ features make Pinocchio a‌​‌ unique tool to efficiently​​ control complex robotic systems​​​‌ such as legged robots‌ or industrial robots. We‌​‌ are now closely collaborating​​ with Pal Robotics which​​​‌ plan to use Pinocchio‌ to control its next‌​‌ generation of humanoid robots​​ called Kangaroo. In the​​​‌ near future, the plan‌ is to extend Pinocchio‌​‌ to become a generic-purposed​​ and efficient robotic simulator​​​‌ simulating both rigid and‌ compliant contact interactions between‌​‌ a robot and its​​ environment, with the ambition​​​‌ of making Pinocchio the‌ next golden framework for‌​‌ simulation in robotics, offering​​ advanced features for optimal​​​‌ control, reinforcement learning, like‌ differentiable simulation. Such features‌​‌ should position Pinocchio as​​ the central simulator in​​​‌ Robotics.

4.3 Image restoration‌

We are pursuing applications‌​‌ of our image restoration​​ work to personal photography,​​​‌ to enhance the images‌ taken by consumer cameras‌​‌ and smartphones by deblurring​​ and denoising them, and​​​‌ improving their spatial resolution‌ and dynamic range. In‌​‌ this context, we are​​ collaborating with DXOMark, the​​​‌ world leader in smartphone‌ camera evaluation, through a‌​‌ CIFRE thesis. Two of​​ the objectives are to​​​‌ develop a public database‌ of portraits fully compliant‌​‌ with European GDRP regulations​​​‌ with informed consent from​ the models, and to​‌ automate the rating of​​ image quality using this​​​‌ dataset. We also apply​ the mixture of physical​‌ image formation model and​​ machine learning principles that​​​‌ has made our image​ restoration work successful to​‌ scientific fields: We collaborate​​ with Anne-Marie Lagrange (Observatoire​​​‌ de Paris), Maud Langlois​ (SNRS/Observatoire de Lyon) and​‌ Julien Mairal (Inria) on​​ direct exoplanet detection from​​​‌ ground-based telescope imagery. This​ work also involves a​‌ post-doc, Olivier Flasseur, and​​ a PhD Student, Théo​​​‌ Bodrito. We will apply​ next year the same​‌ principles to molecular microscopy,​​ in collaboration with Jean-Baptiste​​​‌ Masson (Institut Pasteur).

5​ Social and environmental responsibility​‌

Artificial intelligence holds great​​ potential for improving our​​​‌ environment, for example, by​ reducing energy consumption and​‌ optimizing energy production. Computer​​ vision, in particular, can​​​‌ be used to monitor​ emissions from coal plants​‌ and to track forest​​ growth using satellite imagery.​​​‌ Autonomous drones can monitor​ and prevent failures of​‌ pipelines, power lines, power​​ plants and other remote​​​‌ installations. However, as larger​ and more powerful AI​‌ models require increased compute​​ power at training and​​​‌ deployment, AI itself stands​ for an increasingly high​‌ carbon footprint. One direction​​ of our research aims​​​‌ to develop efficient and​ low-resource neural network models.​‌ To this end, we​​ have previously proposed Cross-Covariance​​​‌ Image Transformers (El-Nouby et​ al. NeurIPS'21) that avoid​‌ quadratic complexity in terms​​ of image size. We​​​‌ have been also working​ on the development of​‌ new optimization methods and​​ associated software (Bambade et​​​‌ al. ICLR'24) to reduce​ the overall computationel burden​‌ and reduce their energetical​​ impact when applied to​​​‌ industrial and practical scenarios.​ In the light of​‌ these devleopments, with the​​ help of the Inria​​​‌ Soft infrastructure, we are​ considering creating a new​‌ software consortium, named Maestro​​, to accelerate the​​​‌ developement and the dissemination​ of efficient algorithmic solutions​‌ for the control of​​ robotics systems. One objective​​​‌ of this consortium is​ to provide software solutions​‌ that reduce the computational​​ burden and energetic consumption​​​‌ of modern robots currently​ deployed in industry or​‌ in societal sectors.

6​​ Highlights of the year​​​‌

6.1 Awards

  • Cordelia Schmid​ has received the Archimedes​‌ Science Award, Dresden, 2025.​​
  • Cordelia Schmid has received​​​‌ the Hans Fischer Senior​ Fellowship, TUM, 2025.
  • Cordelia​‌ Schmid has received the​​ ACM Athena Lecturer Award,​​​‌ 2025.
  • Cordelia Schmid has​ received the Member of​‌ the National Academy of​​ Artificial Intelligence (NAAI), 2025.​​​‌
  • Justin Carpentier and Cordelia​ Schmid have received the​‌ Prix de thèse du​​ GdR Robotique 2025 for​​​‌ Q. Le Lidec.
  • Ajay​ Sathya has received the​‌ IEEE Robotics and Automation​​ Letters Best Paper Award​​​‌.
  • Quentin Le Lidec​ has been awarded the​‌ Best PhD Thesis Award​​ in robotics by the​​​‌ French national robotics network.​
  • Antoine Bambade has received​‌ the Prix Paul Caseau​​, awarded by the​​​‌ French Academy of Technologies​ and EDF.
  • Stéphane Caron​‌ has been awarded a​​ PIQ grant, entitled​​​‌ OSS4EAI.
  • Jean Ponce, together​ with Julien Mairal, have​‌ been awarded an i-Lab​​ award for their startup​​ Enhance Lab.

7​​​‌ Latest software developments, platforms,‌ open data

7.1 Latest‌​‌ software developments

7.1.1 alignsdf​​

  • Keywords:
    Computer vision, 3D​​​‌ reconstruction
  • Functional Description:

    This‌ is the PyTorch implementation‌​‌ of the AlignSDF research​​ paper:

    AlignSDF: Pose-Aligned Signed​​​‌ Distance Fields for Hand-Object‌ Reconstruction Zerui Chen, Yana‌​‌ Hasson, Ivan Laptev, Cordelia​​ Schmid ECCV 2022

  • Publication:​​​‌
  • Contact:
    Zerui Chen‌
  • Participant:
    4 anonymous participants‌​‌

7.1.2 BLERC

7.1.3‌​‌ BurstSR

  • Name:
    Super-resolution from​​ image bursts
  • Keyword:
    Image​​​‌ processing
  • Functional Description:
    This‌ is a research prototpye‌​‌ allowing to take as​​ input a sequence of​​​‌ raw or rgb images‌ produced by a smartphone‌​‌ or digital camera. This​​ code produces a high​​​‌ quality color images with‌ higher resolution.
  • Release Contributions:‌​‌
    This new version, v0.2,​​ introduces various improvements, as​​​‌ well as C++ code‌ that accelerates the original‌​‌ Python code.
  • Publication:
  • Contact:
    Julien Mairal
  • Participant:​​​‌
    3 anonymous participants

7.1.4‌ FrozenBiLM

  • Name:
    Zero-Shot Video‌​‌ Question Answering via Frozen​​ Bidirectional Language Models
  • Keywords:​​​‌
    Computer vision, Natural language‌ processing, Deep learning
  • Functional‌​‌ Description:
    Code, datasets and​​ models associated with the​​​‌ paper "Zero-Shot Video Question‌ Answering via Frozen Bidirectional‌​‌ Language Models"
  • URL:
  • Contact:
    Antoine Yang

7.1.5​​​‌ hiveformer

  • Keywords:
    Robotics, NLP,‌ Transformer
  • Functional Description:

    This‌​‌ is the PyTorch implementation​​ of the Hiveformer research​​​‌ paper:

    Instruction-driven history-aware policies‌ for robotic manipulations Pierre-Louis‌​‌ Guhur, Shizhe Chen, Ricardo​​ Garcia, Makarand Tapaswi, Ivan​​​‌ Laptev, Cordelia Schmid CoRL‌ 2022 (oral)

  • Publication:
  • Contact:
    Pierre-Louis Guhur
  • Participant:​​
    6 anonymous participants

7.1.6​​​‌ HM3DAutoVLN

  • Name:
    Learning from‌ Unlabeled 3D Environments for‌​‌ Vision-and-Language Navigation
  • Keyword:
    Computer​​ vision
  • Functional Description:
    Open​​​‌ source release of the‌ software package for the‌​‌ ECCV'22 paper by Chen​​ et al. "Learning from​​​‌ Unlabeled 3D Environments for‌ Vision-and-Language Navigation". This release‌​‌ provides a full implementation​​ of the method, including​​​‌ code for training models,‌ and testing on standard‌​‌ datasets, generated datasets as​​ well as trained models.​​​‌
  • URL:
  • Publication:
  • Contact:
    Shizhe Chen
  • Participant:‌​‌
    5 anonymous participants

7.1.7​​ Just Ask: Learning to​​​‌ Answer Questions from Millions‌ of Narrated Videos

  • Keywords:‌​‌
    Computer vision, Natural language​​ processing, Deep learning
  • Functional​​​‌ Description:
    Code, datasets and‌ models associated with the‌​‌ paper "Just Ask: Learning​​ to Answer Questions from​​​‌ Millions of Narrated Videos"‌
  • URL:
  • Contact:
    Antoine‌​‌ Yang

7.1.8 Pinocchio

  • Name:​​
    Pinocchio
  • Keywords:
    Robotics, Biomechanics,​​​‌ Mechanical multi-body systems
  • Functional‌ Description:
    Pinocchio instantiates state-of-the-art‌​‌ Rigid Body Algorithms for​​ poly-articulated systems based on​​​‌ revisited Roy Featherstone's algorithms.‌ In addition, Pinocchio instantiates‌​‌ analytical derivatives of the​​ main Rigid-Body Algorithms like​​​‌ the Recursive Newton-Euler Algorithms‌ or the Articulated-Body Algorithm.‌​‌ Pinocchio is first tailored​​ for legged robotics applications,​​​‌ but it can be‌ used in extra contexts.‌​‌ It is built upon​​ Eigen for linear algebra​​​‌ and FCL for collision‌ detection. Pinocchio comes with‌​‌ a Python interface for​​ fast code prototyping.
  • URL:​​​‌
  • Contact:
    Justin Carpentier‌
  • Partner:
    CNRS

7.1.9 ProxSuite‌​‌

  • Name:
    ProxSuite
  • Keywords:
    Conic​​​‌ optimization, Linear optimization, Robotics​
  • Functional Description:

    ProxSuite is​‌ a collection of open-source,​​ numerically robust, precise and​​​‌ efficient numerical solvers (e.g.,​ LPs, QPs, etc.) rooted​‌ in revisited primal-dual proximal​​ algorithms. Through ProxSuite, we​​​‌ aim to offer the​ community scalable optimizers that​‌ can deal with dense,​​ sparse or matrix-free problems.​​​‌ While the first targeted​ application is Robotics, ProxSuite​‌ can be used in​​ other contexts without limits.​​​‌

    ProxSuite is actively developed​ and supported by the​‌ Willow and Sierra research​​ groups, joint research teams​​​‌ between Inria, École Normale​ Supérieure de Paris and​‌ Centre National de la​​ Recherche Scientifique localized in​​​‌ France.

  • Contact:
    Justin Carpentier​

7.1.10 SPE

  • Name:
    Semantics​‌ Preserving Encoder
  • Keywords:
    NLP,​​ Adversarial attack, Word embeddings​​​‌
  • Functional Description:
    Semantics Preserving​ Encoder is a simple,​‌ fully supervised sentence embedding​​ technique for textual adversarial​​​‌ attacks.
  • URL:
  • Contact:​
    Hugo Cisneros
  • Participant:
    3​‌ anonymous participants

7.1.11 TubeDETR​​

  • Name:
    TubeDETR: Spatio-Temporal Video​​​‌ Grounding with Transformers
  • Keywords:​
    Computer vision, Natural language​‌ processing, Deep learning
  • Functional​​ Description:
    Code, datasets and​​​‌ models associated with the​ paper "TubeDETR: Spatio-Temporal Video​‌ Grounding with Transformers"
  • URL:​​
  • Contact:
    Antoine Yang​​​‌

7.1.12 vil3dref

  • Name:
    Language​ Conditioned Spatial Relation Reasoning​‌ for 3D Object Grounding​​
  • Keyword:
    Computer vision
  • Functional​​​‌ Description:
    Open source release​ of the software package​‌ for the NeurIPS'22 paper​​ by Chen et al.​​​‌ "Language Conditioned Spatial Relation​ Reasoning for 3D Object​‌ Grounding". This release provides​​ a full implementation of​​​‌ the method, including code​ for training models, and​‌ testing on standard datasets,​​ as well as trained​​​‌ models.
  • URL:
  • Publication:​
  • Contact:
    Shizhe Chen​‌
  • Participant:
    5 anonymous participants​​

7.1.13 VLN-DUET

  • Name:
    Think​​​‌ Global, Act Local: Dual-scale​ Graph Transformer for Vision-and-Language​‌ Navigation
  • Keyword:
    Computer vision​​
  • Functional Description:
    Open source​​​‌ release of the software​ package for the CVPR'22​‌ paper by Chen et​​ al. "Think Global, Act​​​‌ Local: Dual-scale Graph Transformer​ for Vision-and-Language Navigation". This​‌ release provides a full​​ implementation of the method,​​​‌ including codes for training​ models, and testing on​‌ standard datasets, as well​​ as trained models.
  • URL:​​​‌
  • Publication:
  • Contact:​
    Shizhe Chen
  • Participant:
    5​‌ anonymous participants

Participants: Jean​​ Ponce, Justin Carpentier​​​‌, Cordelia Schmid,​ Ivan Laptev, Etienne​‌ Arlaud, Pierre-Guillaume Raverdy​​, Stephane Caron,​​​‌ Shizhe Chen.

7.2​ New platforms

Together with​‌ SED, we are bulding​​ the new robotics laboratory​​​‌ at Inria Paris located​ on the 1st floor​‌ of the A building.​​ The lab hosts a​​​‌ diverse set of robotic​ platforms covering dexterous manipulation,​‌ legged locomotion, and mobile​​ robotics. The current equipment​​​‌ includes three UR5 robotic​ arms, an Allegro Hand,​‌ a Shadow Hand, and​​ a TIAGo++ robot integrating​​​‌ both a mobile base​ and a manipulator. For​‌ legged and mobile experiments,​​ the lab includes Upkie​​​‌ biped, the Unitree GO2​ quadruped, and the ODRI​‌ Solo-12 quadruped. In 2025,​​ the laboratory expanded its​​​‌ fleet with the acquisition​ of two Unitree G1​‌ humanoid robots. The robotics​​ laboratory is also equipped​​​‌ with a dedicated Motion​ Capture system for precise​‌ object localization and robot​​ calibration. These robotic patforms​​ will enable our future​​​‌ research and experiments with‌ locomotion navigation and manipulation.‌​‌

7.3 Open data

8​​ New results

8.1 Visual​​​‌ recognition and reconstruction of‌ images and videos

8.1.1‌​‌ MaskCaptioner: Learning to Jointly​​ Segment and Caption Object​​​‌ Trajectories in Videos

Participants:‌ Gabriel Fiastre, Antoine‌​‌ Yang, Cordelia Schmid​​.

Dense Video Object​​​‌ Captioning (DVOC) is the‌ task of jointly detecting,‌​‌ tracking, and captioning object​​ trajectories in a video,​​​‌ requiring the ability to‌ understand spatio-temporal details and‌​‌ describe them in natural​​ language. Due to the​​​‌ complexity of the task‌ and the high cost‌​‌ associated with manual annotation,​​ previous approaches resort to​​​‌ disjoint training strategies, potentially‌ leading to suboptimal performance.‌​‌ To circumvent this issue,​​ we propose in this​​​‌ work 44 to generate‌ captions about spatio-temporally localized‌​‌ entities leveraging a state-of-the-art​​ VLM. By extending the​​​‌ LVIS and LV-VIS datasets‌ with our synthetic captions‌​‌ (LVISCap and LV-VISCap), we​​ train MaskCaptioner, an end-to-end​​​‌ model capable of jointly‌ detecting, segmenting, tracking and‌​‌ captioning object trajectories. Moreover,​​ with pretraining on LVISCap​​​‌ and LV-VISCap, MaskCaptioner achieves‌ state-of-the-art DVOC results on‌​‌ three existing benchmarks, VidSTG,​​ VLN and BenSMOT. The​​​‌ datasets and code are‌ available at here.‌​‌

Figure 1

Example of synthetic captions​​ in our LV-VISCap dataset.​​​‌

Figure 1: Example‌ of synthetic captions in‌​‌ our LV-VISCap dataset.

8.1.2​​ ComposeAnything: Composite Object Priors​​​‌ for Text-to-Image Generation

Participants:‌ Zeeshan Khan, Shizhe‌​‌ Chen, Cordelia Schmid​​.

This paper 47​​​‌ addresses the problem of‌ Compositional text-to-image generation. Current‌​‌ text-to-image models struggle to​​ generate scenes with many​​​‌ objects and complex relations.‌ Training-time solutions such as‌​‌ layout conditioning or reinforcement​​ learning improve compositional accuracy​​​‌ but often degrade image‌ quality and realism by‌​‌ enforcing rigid constraints. To​​ address this limitation, we​​​‌ introduce ComposeAnything, an inference-only‌ framework that injects a‌​‌ structured composite object prior​​ directly into the diffusion​​​‌ process. Rather than starting‌ from random latent noises‌​‌ or performing expensive noise​​ optimization, we construct a​​​‌ single 2.5D composite prior‌ encoding strong object appearance,‌​‌ counts, sizes, and coarse​​ depth-aware placement, and use​​​‌ it to initialize and‌ guide one diffusion trajectory.‌​‌ This explicit prior is​​ interpretable and editable in​​​‌ image space, enabling human-in-the-loop‌ refinement by simply adjusting‌​‌ the composite. Our training-free,​​ backbone-agnostic method improves compositional​​​‌ consistency on T2I-CompBench and‌ NSR-1K benchmarks, particularly for‌​‌ complex prompts, while maintaining​​ high visual quality compared​​​‌ to both training-based baselines‌ and other inference-time methods.‌​‌

Figure 2

ComposeAnything enables text-to-image generation​​ for complex compositions involving​​​‌ surreal spatial relationships and‌ high object counts. Achieving‌​‌ both high visual quality​​ and strong faithfulness to​​​‌ text.

Figure 2:‌ ComposeAnything enables text-to-image generation‌​‌ for complex compositions involving​​ surreal spatial relationships and​​​‌ high object counts. Achieving‌ both high visual quality‌​‌ and strong faithfulness to​​ text.

8.1.3 FACap: A​​​‌ Large-Scale Fashion Dataset for‌ Fine-grained Composed Image Retrieval‌​‌

Participants: François Gardères,​​ Camille-Sovanneary Gauthier, Shizhe​​​‌ Chen, Jean Ponce‌.

The composed image‌​‌ retrieval (CIR) task is​​ to retrieve target images​​​‌ given a reference image‌ and a modification text.‌​‌ Recent methods for CIR​​​‌ leverage large pretrained vision-language​ models (VLMs) and achieve​‌ good performance on general-domain​​ concepts like color and​​​‌ texture. However, they still​ struggle with application domains​‌ like fashion, because the​​ rich and diverse vocabulary​​​‌ used in fashion requires​ specific fine-grained vision and​‌ language understanding. An additional​​ difficulty is the lack​​​‌ of large-scale fashion datasets​ with detailed and relevant​‌ annotations, due to the​​ expensive cost of manual​​​‌ annotation by specialists. To​ address these challenges, we​‌ introduce in this paper​​ 33FACap, a​​​‌ large-scale, automatically constructed fashion-domain​ CIR dataset. It leverages​‌ web-sourced fashion images and​​ a two-stage annotation pipeline​​​‌ powered by a VLM​ and a large language​‌ model (LLM) to generate​​ accurate and detailed modification​​​‌ texts. Then, we propose​ a new CIR model​‌ FashionBLIP-2, which fine-tunes​​ the general-domain BLIP-2 model​​​‌ on FACap with lightweight​ adapters and multi-head query-candidate​‌ matching to better account​​ for fine-grained fashion-specific information.​​​‌ FashionBLIP-2 is evaluated with​ and without additional fine-tuning​‌ on the Fashion IQ​​ benchmark and the enhanced​​​‌ evaluation dataset enhFashionIQ, leveraging​ our pipeline to obtain​‌ higher-quality annotations. Experimental results​​ show that the combination​​​‌ of FashionBLIP-2 and pretraining​ with FACap significantly improves​‌ the model's performance in​​ fashion CIR especially for​​​‌ retrieval with fine-grained modification​ texts, demonstrating the value​‌ of our dataset and​​ approach in a highly​​​‌ demanding environment such as​ e-commerce websites.

Figure 3

Our automatically​‌ constructed FACap dataset offers​​ more detailed and accurate​​​‌ annotations than existing datasets​ for the fashion CIR​‌ task.

Figure 3:​​ Our automatically constructed FACap​​​‌ dataset offers more detailed​ and accurate annotations than​‌ existing datasets for the​​ fashion CIR task.

8.1.4​​​‌ Chapter-Llama: Efficient Chaptering in​ Hour-Long Videos with LLMs​‌

Participants: Lucas Ventura,​​ Antoine Yang, Cordelia​​​‌ Schmid, Gül Varol​.

We address the​‌ task of video chaptering,​​ i.e., partitioning a long​​​‌ video timeline into semantic​ units and generating corresponding​‌ chapter titles. While relatively​​ underexplored, automatic chaptering has​​​‌ the potential to enable​ efficient navigation and content​‌ retrieval in long-form videos.​​ In this paper 31​​​‌, we achieve strong​ chaptering performance on hour-long​‌ videos by efficiently addressing​​ the problem in the​​​‌ text domain with our​ 'Chapter-Llama' framework. Specifically, we​‌ leverage a pretrained large​​ language model (LLM) with​​​‌ large context window, and​ feed as input (i)​‌ speech transcripts and (ii)​​ captions describing video frames,​​​‌ along with their respective​ timestamps. Given the inefficiency​‌ of exhaustively captioning all​​ frames, we propose a​​​‌ lightweight speech-guided frame selection​ strategy based on speech​‌ transcript content, and experimentally​​ demonstrate remarkable advantages. We​​​‌ train the LLM to​ output timestamps for the​‌ chapter boundaries, as well​​ as free-form chapter titles.​​​‌ This simple yet powerful​ approach scales to processing​‌ one-hour long videos in​​ a single forward pass.​​​‌ Our results demonstrate substantial​ improvements (e.g., 45.3 vs​‌ 26.7 F1 score) over​​ the state of the​​​‌ art on the recent​ VidChapters-7M benchmark. To promote​‌ further research, we release​​ our code and models​​​‌ at our project page.​

8.1.5 Online 3D Scene​‌ Reconstruction Using Neural Object​​ Priors

Participants: Thomas Chabal​​, Shizhe Chen,​​​‌ Jean Ponce, Cordelia‌ Schmid.

This paper‌​‌ 23 addresses the problem​​ of reconstructing a scene​​​‌ online at the level‌ of objects given an‌​‌ RGB-D video sequence. While​​ current object-aware neural implicit​​​‌ representations hold promise, they‌ are limited in online‌​‌ reconstruction efficiency and shape​​ completion. Our main contributions​​​‌ to alleviate the above‌ limitations are twofold. First,‌​‌ we propose a feature​​ grid interpolation mechanism to​​​‌ continuously update grid-based object-centric‌ neural implicit representations as‌​‌ new object parts are​​ revealed. Second, we construct​​​‌ an object library with‌ previously mapped objects in‌​‌ advance and leverage the​​ corresponding shape priors to​​​‌ initialize geometric object models‌ in new videos, subsequently‌​‌ completing them with novel​​ views as well as​​​‌ synthesized past views to‌ avoid losing original object‌​‌ details. Extensive experiments on​​ synthetic environments from the​​​‌ Replica dataset, real-world ScanNet‌ sequences and videos captured‌​‌ in our laboratory demonstrate​​ that our approach outperforms​​​‌ state-of-the-art neural implicit models‌ for this task in‌​‌ terms of reconstruction accuracy​​ and completeness.

Figure 4

Our method​​​‌ reconstructs scenes at the‌ level of objects from‌​‌ RGB-D videos on the​​ fly. We leverage 3D​​​‌ shape priors from a‌ pre-computed object library to‌​‌ enhance accuracy and completeness​​ of geometry reconstruction for​​​‌ individual objects.

Figure 4‌: Our method reconstructs‌​‌ scenes at the level​​ of objects from RGB-D​​​‌ videos on the fly.‌ We leverage 3D shape‌​‌ priors from a pre-computed​​ object library to enhance​​​‌ accuracy and completeness of‌ geometry reconstruction for individual‌​‌ objects.

8.1.6 Detecting Looted​​ Archaeological Sites from Satellite​​​‌ Image Time Series

Participants:‌ Elliot Vincent, Mehraïl‌​‌ Saroufim, Jonathan Chemla​​, Yves Ubelmann,​​​‌ Philippe Marquis, Jean‌ Ponce, Mathieu Aubry‌​‌.

Archaeological sites are​​ the physical remains of​​​‌ past human activity and‌ one of the main‌​‌ sources of information about​​ past societies and cultures.​​​‌ However, they are also‌ the target of malevolent‌​‌ human actions, especially in​​ countries having experienced inner​​​‌ turmoil and conflicts. Because‌ monitoring these sites from‌​‌ space is a key​​ step towards their preservation,​​​‌ we introduce the DAFA‌ Looted Sites dataset 32‌​‌, a labeled multi-temporal​​ remote sensing dataset containing​​​‌ 55,480 images acquired monthly‌ over 8 years across‌​‌ 675 Afghan archaeological sites,​​ including 135 sites looted​​​‌ during the acquisition period.‌ It is particularly challenging‌​‌ because of the limited​​ number of training samples,​​​‌ the class imbalance, the‌ weak binary annotations only‌​‌ available at the level​​ of the time series,​​​‌ and the subtlety of‌ relevant changes coupled with‌​‌ important irrelevant ones over​​ a long time period.​​​‌ It is also an‌ interesting playground to assess‌​‌ the performance of satellite​​ image time series (SITS)​​​‌ classification methods on a‌ real and important use‌​‌ case. We evaluate a​​ large set of baselines,​​​‌ outline the substantial benefits‌ of using foundation models‌​‌ and show the additional​​ boost that can be​​​‌ provided by using complete‌ time series instead of‌​‌ using a single image.​​

8.1.7 Towards Zero-Shot Multimodal​​​‌ Machine Translation

Participants: Matthieu‌ Futeral, Cordelia Schmid‌​‌, Benoît Sagot,​​​‌ Rachel Bawden.

Current​ multimodal machine translation (MMT)​‌ systems rely on fully​​ supervised data (i.e models​​​‌ are trained on sentences​ with their translations and​‌ accompanying images). However, this​​ type of data is​​​‌ costly to collect, limiting​ the extension of MMT​‌ to other language pairs​​ for which such data​​​‌ does not exist. In​ this work 24,​‌ we propose a method​​ to bypass the need​​​‌ for fully supervised data​ to train MMT systems,​‌ using multimodal English data​​ only. Our method, called​​​‌ ZeroMMT, consists in adapting​ a strong text-only machine​‌ translation (MT) model by​​ training it on a​​​‌ mixture of two objectives:​ visually conditioned masked language​‌ modelling and the Kullback-Leibler​​ divergence between the original​​​‌ and new MMT outputs.​ We evaluate on standard​‌ MMT benchmarks and the​​ recently released CoMMuTE, a​​​‌ contrastive benchmark aiming to​ evaluate how well models​‌ use images to disambiguate​​ English sentences. We obtain​​​‌ disambiguation performance close to​ state-of-the-art MMT models trained​‌ additionally on fully supervised​​ examples. To prove that​​​‌ our method generalizes to​ languages with no fully​‌ supervised training data available,​​ we extend the CoMMuTE​​​‌ evaluation dataset to three​ new languages: Arabic, Russian​‌ and Chinese. We further​​ show that we can​​​‌ control the trade-off between​ disambiguation capabilities and translation​‌ fidelity at inference time​​ using classifier-free guidance and​​​‌ without any additional data.​ Our code, data and​‌ trained models are publicly​​ accessible.

8.1.8 mOSCAR: A​​​‌ Large-scale Multilingual and Multimodal​ Document-level Corpus

Participants: Matthieu​‌ Futeral, Armel Zebaze​​, Pedro Ortiz Suarez​​​‌, Julien Abadji,​ Rémi Lacroix, Cordelia​‌ Schmid, Rachel Bawden​​, Benoît Sagot.​​​‌

Multimodal Large Language Models​ (mLLMs) are trained on​‌ a large amount of​​ text-image data. While most​​​‌ mLLMs are trained on​ caption-like data only, Alayrac​‌ et al. [2022] showed​​ that additionally training them​​​‌ on interleaved sequences of​ text and images can​‌ lead to the emergence​​ of in-context learning capabilities.​​​‌ However, the dataset they​ used, M3W, is not​‌ public and is only​​ in English. There have​​​‌ been attempts to reproduce​ their results but the​‌ released datasets are English-only.​​ In contrast, current multilingual​​​‌ and multimodal datasets are​ either composed of caption-like​‌ only or medium-scale or​​ fully private data. This​​​‌ limits mLLM research for​ the 7,000 other languages​‌ spoken in the world.​​ We therefore introduce mOSCAR​​​‌ 25, to the​ best of our knowledge​‌ the first large-scale multilingual​​ and multimodal document corpus​​​‌ crawled from the web.​ It covers 163 languages,​‌ 315M documents, 214B tokens​​ and 1.2B images. We​​​‌ carefully conduct a set​ of filtering and evaluation​‌ steps to make sure​​ mOSCAR is sufficiently safe,​​​‌ diverse and of good​ quality. We additionally train​‌ two types of multilingual​​ model to prove the​​​‌ benefits of mOSCAR: (1)​ a model trained on​‌ a subset of mOSCAR​​ and captioning data and​​​‌ (2) a model train​ on captioning data only.​‌ The model additionally trained​​ on mOSCAR shows a​​​‌ strong boost in few-shot​ learning performance across various​‌ multilingual image-text tasks and​​ benchmarks, confirming previous findings​​ for English-only mLLMs.

Figure 5

Example​​​‌ of a French document‌ from mOSCAR.

Figure 5‌​‌: Example of a​​ French document from mOSCAR.​​​‌

8.1.9 Dual Perspectives on‌ Non-Contrastive Self-Supervised Learning

Participants:‌​‌ Jean Ponce, Basile​​ Terver, Martial Hebert​​​‌, Michael Arbel.‌

The stop gradient and‌​‌ exponential moving average iterative​​ procedures are commonly used​​​‌ in non-contrastive approaches to‌ self-supervised learning to avoid‌​‌ representation collapse, with excellent​​ performance in downstream applications​​​‌ in practice. This presentation‌ investigates these procedures from‌​‌ the dual viewpoints of​​ optimization and dynamical systems.​​​‌ We show that, in‌ general, although they do‌​‌ not optimize the original​​ objective, or any other​​​‌ smooth function, they do‌ avoid collapse. Following Tian‌​‌ et al. 2021, but​​ without any of the​​​‌ extra assumptions used in‌ their proofs, we then‌​‌ show using a dynamical​​ system perspective that, in​​​‌ the linear case, minimizing‌ the original objective function‌​‌ without the use of​​ a stop gradient or​​​‌ exponential moving average always‌ leads to collapse. Conversely,‌​‌ we characterize explicitly the​​ equilibria of the dynamical​​​‌ systems associated with these‌ two procedures in this‌​‌ linear setting as algebraic​​ varieties in their parameter​​​‌ space, and show that‌ they are, in general,‌​‌ asymptotically stable. Our​​ theoretical findings 53 are​​​‌ illustrated by empirical experiments‌ with real and synthetic‌​‌ data.

8.1.10 Optimal transport​​ unlocks end-to-end learning for​​​‌ single-molecule localization

Participants: Romain‌ Seailles, Jean-Baptiste Masson‌​‌, Jean Ponce,​​ Julien Mairal.

Single-molecule​​​‌ localization microscopy (SMLM) allows‌ reconstructing biology-relevant structures beyond‌​‌ the diffraction limit by​​ detecting and localizing individual​​​‌ fluorophores — fluorescent molecules‌ stained onto the observed‌​‌ specimen — over time​​ to reconstruct super-resolved images.​​​‌ Currently, efficient SMLM requires‌ non-overlapping emitting fluorophores, leading‌​‌ to long acquisition times​​ that hinders live-cell imaging.​​​‌ Recent deep-learning approaches can‌ handle denser emissions, but‌​‌ they rely on variants​​ of non-maximum suppression (NMS)​​​‌ layers, which are unfortunately‌ non-differentiable and may discard‌​‌ true positives with their​​ local fusion strategy. In​​​‌ this presentation 36,‌ we reformulate the SMLM‌​‌ training objective as a​​ set-matching problem, deriving an​​​‌ optimal-transport loss that eliminates‌ the need for NMS‌​‌ during inference and enables​​ end-to-end training. Additionally, we​​​‌ propose an iterative neural‌ network that integrates knowledge‌​‌ of the microscope's optical​​ system inside our model.​​​‌ Experiments on synthetic benchmarks‌ and real biological data‌​‌ show that both our​​ new loss function and​​​‌ architecture surpass the state‌ of the art at‌​‌ moderate and high emitter​​ densities. Code is available​​​‌ at here.

8.2‌ Learning embodied representations

8.2.1‌​‌ NextBestPath: Efficient 3D Mapping​​ of Unseen Environments

Participants:​​​‌ Shiyao Li, Antoine‌ Guédon, Clémentin Boittiaux‌​‌, Shizhe Chen,​​ Vincent Lepetit.

This​​​‌ paper 27 addresses the‌ problem of active 3D‌​‌ mapping, where an agent​​ must find an efficient​​​‌ trajectory to exhaustively reconstruct‌ a new scene. Previous‌​‌ approaches mainly predict the​​ next best view near​​​‌ the agent's location, which‌ is prone to getting‌​‌ stuck in local areas.​​ Additionally, existing indoor datasets​​​‌ are insufficient due to‌ limited geometric complexity and‌​‌ inaccurate ground truth meshes.​​​‌ To overcome these limitations,​ we introduce a novel​‌ dataset AiMDoom with a​​ map generator for the​​​‌ Doom video game, enabling​ to better benchmark active​‌ 3D mapping in diverse​​ indoor environments. Moreover, we​​​‌ propose a new method​ we call next-best-path (NBP),​‌ which predicts long-term goals​​ rather than focusing solely​​​‌ on short-sighted views. The​ model jointly predicts accumulated​‌ surface coverage gains for​​ long-term goals and obstacle​​​‌ maps, allowing it to​ efficiently plan optimal paths​‌ with a unified model.​​ By leveraging online data​​​‌ collection, data augmentation and​ curriculum learning, NBP significantly​‌ outperforms state-of-the-art methods on​​ both the existing MP3D​​​‌ dataset and our AiMDoom​ dataset, achieving more efficient​‌ mapping in indoor environments​​ of varying complexity.

Figure 6

Overview​​​‌ of the proposed next-best-path​ (NBP) framework. The model​‌ (left) predicts a value​​ map of coverage gain​​​‌ and an obstacle map,​ which are used for​‌ decision making (right) to​​ obtain a next-best path.​​​‌

Figure 6: Overview​ of the proposed next-best-path​‌ (NBP) framework. The model​​ (left) predicts a value​​​‌ map of coverage gain​ and an obstacle map,​‌ which are used for​​ decision making (right) to​​​‌ obtain a next-best path.​

8.2.2 FOM-Nav: Frontier-Object Maps​‌ for Object Goal Navigation​​

Participants: Thomas Chabal,​​​‌ Shizhe Chen, Jean​ Ponce, Cordelia Schmid​‌.

This paper 42​​ addresses the Object Goal​​​‌ Navigation problem, where a​ robot must efficiently find​‌ a target object in​​ an unknown environment. Existing​​​‌ implicit memory-based methods struggle​ with long-term memory retention​‌ and planning, while explicit​​ map-based approaches lack rich​​​‌ semantic information. To address​ these challenges, we propose​‌ FOM-Nav, a modular framework​​ that enhances exploration efficiency​​​‌ through Frontier-Object Maps and​ vision-language models. Our Frontier-Object​‌ Maps are built online​​ and jointly encode spatial​​​‌ frontiers and fine-grained object​ information. Using this representation,​‌ a vision-language model performs​​ multimodal scene understanding and​​​‌ high-level goal prediction, which​ is executed by a​‌ low-level planner for efficient​​ trajectory generation. To train​​​‌ FOM-Nav, we automatically construct​ large-scale navigation datasets from​‌ real-world scanned environments. Extensive​​ experiments validate the effectiveness​​​‌ of our model design​ and constructed dataset. FOM-Nav​‌ achieves state-ofthe-art performance on​​ the MP3D and HM3D​​​‌ benchmarks, particularly in navigation​ efficiency metric SPL, and​‌ yields promising results on​​ a real robot.

Figure 7

The​​​‌ proposed frontier-object map is​ a rich representation of​‌ objects and frontiers (boundaries​​ of the explored scene),​​​‌ displayed here as colored​ point clouds and red​‌ lines. It encodes geometric,​​ distance and visual/textual information​​​‌ for frontiers and objects.​

Figure 7: The​‌ proposed frontier-object map is​​ a rich representation of​​​‌ objects and frontiers (boundaries​ of the explored scene),​‌ displayed here as colored​​ point clouds and red​​​‌ lines. It encodes geometric,​ distance and visual/textual information​‌ for frontiers and objects.​​

8.2.3 Gondola: Grounded Vision​​​‌ Language Planning for Generalizable​ Robotic Manipulation

Participants: Shizhe​‌ Chen, Ricardo Garcia​​, Paul Pacaud,​​​‌ Cordelia Schmid.

Robotic​ manipulation faces a significant​‌ challenge in generalizing across​​ unseen objects, environments and​​​‌ tasks specified by diverse​ language instructions. To improve​‌ generalization capabilities, recent research​​ has incorporated large language​​ models (LLMs) for planning​​​‌ and action execution. While‌ promising, these methods often‌​‌ fall short in generating​​ grounded plans in visual​​​‌ environments. Although efforts have‌ been made to perform‌​‌ visual instructional tuning on​​ LLMs for robotic manipulation,​​​‌ existing methods are typically‌ constrained by single-view image‌​‌ input and struggle with​​ precise object grounding. In​​​‌ this work 43,‌ we introduce Gondola, a‌​‌ novel grounded vision-language planning​​ model based on LLMs​​​‌ for generalizable robotic manipulation.‌ Gondola takes multi-view images‌​‌ and history plans to​​ produce the next action​​​‌ plan with interleaved texts‌ and segmentation masks of‌​‌ target objects and locations.​​ To support the training​​​‌ of Gondola, we construct‌ three types of datasets‌​‌ using the RLBench simulator,​​ namely robot grounded planning,​​​‌ multi-view referring expression and‌ pseudo long-horizon task datasets.‌​‌ Gondola outperforms the state-of-the-art​​ LLM-based method across all​​​‌ four generalization levels of‌ the GemBench dataset, including‌​‌ novel placements, rigid objects,​​ articulated objects and long-horizon​​​‌ tasks.

Figure 8

Gondola leverages multi-view‌ images for 3D scene‌​‌ perception and segmentation masks​​ to provide precisely grounded​​​‌ plans.

Figure 8:‌ Gondola leverages multi-view images‌​‌ for 3D scene perception​​ and segmentation masks to​​​‌ provide precisely grounded plans.‌

8.2.4 Collision avoidance from‌​‌ monocular vision trained with​​ novel view synthesis

Participants:​​​‌ Valentin Tordjman--Levavasseur, Stéphane‌ Caron.

Collision avoidance‌​‌ can be checked in​​ explicit environment models such​​​‌ as elevation maps or‌ occupancy grids, yet integrating‌​‌ such models with a​​ locomotion policy requires accurate​​​‌ state estimation. In 58‌, we consider the‌​‌ question of collision avoidance​​ from an implicit environment​​​‌ model. We use monocular‌ RGB images as inputs‌​‌ and train a collisionavoidance​​ policy from photorealistic images​​​‌ generated by 2D Gaussian‌ splatting. We evaluate the‌​‌ resulting pipeline in realworld​​ experiments under velocity commands​​​‌ that bring the robot‌ on an intercept course‌​‌ with obstacles. Our results​​ suggest that RGB images​​​‌ can be enough to‌ make collision-avoidance decisions, both‌​‌ in the room where​​ training data was collected​​​‌ and in out-of-distribution environments.‌

Figure 9

Effect of the vision-based‌​‌ collision-avoidance policy when the​​ commanded velocity prompts the​​​‌ robot to collide with‌ a wall. Blue: joystick‌​‌ user input, kept stationary​​ at full forward throttle.​​​‌ Green: trajectory actually followed‌ by the robot after‌​‌ compensation by the policy.​​

Figure 9: Effect​​​‌ of the vision-based collision-avoidance‌ policy when the commanded‌​‌ velocity prompts the robot​​ to collide with a​​​‌ wall. Blue: joystick user‌ input, kept stationary at‌​‌ full forward throttle. Green:​​ trajectory actually followed by​​​‌ the robot after compensation‌ by the policy.

8.2.5‌​‌ KernelSOS for Global Sampling-Based​​ Optimal Control and Estimation​​​‌ via Semidefinite Programming

Participants:‌ Antoine Groudiev, Fabian‌​‌ Schramm, Eloïse Berthier​​, Justin Carpentier,​​​‌ Frederike Dümbgen.

Global‌ optimization has gained attraction‌​‌ over the past decades,​​ thanks to the development​​​‌ of both theoretical foundations‌ and efficient numerical routines‌​‌ to cope with optimization​​ problems of various complexities.​​​‌ Among recent methods, Kernel‌ Sum of Squares (KernelSOS)‌​‌ appears as a powerful​​ framework, leveraging the potential​​​‌ of sum of squares‌ methods from the polynomial‌​‌ optimization community with the​​​‌ expressivity of kernel methods​ widely used in machine​‌ learning. This paper 46​​ applies the kernel sum​​​‌ of squares framework for​ solving control and estimation​‌ problems, which exhibit poor​​ local minima. We demonstrate​​​‌ that KernelSOS performs well​ on a selection of​‌ problems from both domains.​​ In particular, we show​​​‌ that KernelSOS is competitive​ with other sum of​‌ squares approaches on estimation​​ problems, while being applicable​​​‌ to non-polynomial and non-parametric​ formulations. The samplebased nature​‌ of KernelSOS allows us​​ to apply it to​​​‌ trajectory optimization problems with​ an integrated simulator treated​‌ as a black box,​​ both as a standalone​​​‌ method and as a​ powerful initialization method for​‌ local solvers, facilitating the​​ discovery of better solutions.​​​‌

8.2.6 Sobolev Diffusion Policy​

Participants: Theotime Le Hellard​‌, Franki Nguimatsia Tiofack​​, Quentin Le Lidec​​​‌, Justin Carpentier.​

This paper 48 introduces​‌ a novel framework to​​ combine the strengths of​​​‌ policy learning and trajectory​ optimization effectively. On the​‌ one hand, it builds​​ upon diffusion policy, an​​​‌ expressive imitation learning method​ based on diffusion probabilistic​‌ generative models. On the​​ other hand, it uses​​​‌ gradient-based trajectory optimization solvers​ to generate locally optimal​‌ trajectories and leverage their​​ associated feedback gains, doing​​​‌ Sobolev training with first-order​ information. Combining both, we​‌ introduce a first-order loss​​ for diffusion-based policies. The​​​‌ framework alternates between collecting​ trajectories using a solver​‌ warm-started by the policy​​ and training. Through comprehensive​​​‌ experiments, we demonstrate how​ the Sobolev component significantly​‌ reduces the number of​​ trajectories required for the​​​‌ policy to converge globally.​ First-order information both avoids​‌ overfitting, despite using very​​ few samples, and mitigates​​​‌ the compounding error issue​ of imitation-based policies, even​‌ when predicting torques for​​ tasks requiring high-frequency control.​​​‌ We benchmark the benefits​ of SDP on various​‌ robotics tasks of increasing​​ complexity. In particular, SDP​​​‌ shows to be stable​ over extended horizons, with​‌ fewer diffusion steps, shrinking​​ the overall rollout time​​​‌ compared to vanilla diffusion​ models. And when used​‌ to compute initial guesses​​ for trajectory optimization, it​​​‌ reduces the solving time​ by a factor of​‌ 2 to 20.

Figure 10

A​​ task involving to move​​​‌ the arm's end-effector from​ the blue sphere to​‌ the red one. The​​ pink trajectory is obtained​​​‌ by a trajectory optimization​ (TO) solver alone, the​‌ orange one by our​​ SDP method, and the​​​‌ gray one is the​ SDP trajectory refined by​‌ the solver. SDP finds​​ more direct trajectories, while​​​‌ the TO solver may​ be stuck in local​‌ minima.

Figure 10:​​ A task involving to​​​‌ move the arm's end-effector​ from the blue sphere​‌ to the red one.​​ The pink trajectory is​​​‌ obtained by a trajectory​ optimization (TO) solver alone,​‌ the orange one by​​ our SDP method, and​​​‌ the gray one is​ the SDP trajectory refined​‌ by the solver. SDP​​ finds more direct trajectories,​​​‌ while the TO solver​ may be stuck in​‌ local minima.

8.2.7 First-order​​ Sobolev Reinforcement Learning

Participants:​​​‌ Fabian Schramm, Nicolas​ Perrin-Gilbert, Justin Carpentier​‌.

This paper 55​​ proposes a refinement of​​ temporal-difference learning that enforces​​​‌ first-order Bellman consistency: the‌ learned value function is‌​‌ trained to match not​​ only the Bellman targets​​​‌ in value but also‌ their derivatives with respect‌​‌ to states and actions.​​ By differentiating the Bellman​​​‌ backup through differentiable dynamics,‌ we obtain analytically consistent‌​‌ gradient targets. Incorporating these​​ into the critic objective​​​‌ using a Sobolev-type loss‌ encourages the critic to‌​‌ align with both the​​ value and local geometry​​​‌ of the target function.‌ This first-order TD matching‌​‌ principle can be seamlessly​​ integrated into existing algorithms,​​​‌ such as Q-learning or‌ actor-critic methods (e.g., DDPG,‌​‌ SAC), potentially leading to​​ faster critic convergence and​​​‌ more stable policy gradients‌ without altering their overall‌​‌ structure.

Figure 11

Comparison of Q​​-function slices.

Figure​​​‌ 11: Comparison of‌ Q-function slices: ground-truth‌​‌ (black), Sobolev Q-learning (blue),​​ and dashed default Q-learning​​​‌ (red) after 200 and‌ 400 training step for‌​‌ two different states s​​=0.0​​​‌ and s=0‌.5.

8.2.8‌​‌ Control of Humanoid Robots​​ with Parallel Mechanisms using​​​‌ Differential Actuation Models

Participants:‌ Victor Lutz, Ludovic‌​‌ de Matteis, Virgile​​ Batto, Nicolas Mansard​​​‌.

Several recently released‌ humanoid robots, inspired by‌​‌ the mechanical design of​​ Cassie, employ actuator configurations​​​‌ in which the motors‌ are displaced from the‌​‌ joints to reduce leg​​ inertia. While studies accounting​​​‌ for the full kinematic‌ complexity have demonstrated the‌​‌ benefits of these designs,​​ the associated loop-closure constraints​​​‌ greatly increase computational cost‌ and limit their use‌​‌ in control and learning.​​ As a result, the​​​‌ non-linear transmission is often‌ approximated by a constant‌​‌ reduction ratio, preventing exploitation​​ of the mechanism’s full​​​‌ capabilities. This paper 50‌ introduces a compact analytical‌​‌ formulation for the two​​ standard knee and ankle​​​‌ mechanisms that captures the‌ exact non-linear transmission while‌​‌ remaining computationally efficient. The​​ model is fully differentiable​​​‌ up to second order‌ with a minimal formulation,‌​‌ enabling low-cost evaluation of​​ dynamic derivatives for trajectory​​​‌ optimization and of the‌ apparent transmission impedance for‌​‌ reinforcement learning. We integrate​​ this formulation into trajectory​​​‌ optimization and locomotion policy‌ learning, and compare it‌​‌ against simplified constant-ratio approaches.​​ Hardware experiments demonstrate improved​​​‌ accuracy and robustness, showing‌ that the proposed method‌​‌ provides a practical means​​ to incorporate parallel actuation​​​‌ into modern control algorithms.‌

8.2.9 On the Conic‌​‌ Complementarity of Planar Contacts​​

Participants: Yann de Mont-Marin​​​‌, Louis Montaut,‌ Jean Ponce, Martial‌​‌ Hebert, Justin Carpentier​​.

We present a​​​‌ unifying theoretical result 30‌ that connects two foundational‌​‌ principles in robotics: the​​ Signorini law for point​​​‌ contacts, which underpins many‌ simulation methods for preventing‌​‌ object interpenetration, and the​​ center of pressure (also​​​‌ known as the zero-moment‌ point), a key concept‌​‌ used in, for instance,​​ optimization-based locomotion control. Our​​​‌ contribution is the planar‌ Signorini condition, a conic‌​‌ complementarity formulation that models​​ general planar contacts between​​​‌ rigid bodies. We prove‌ that this formulation is‌​‌ equivalent to enforcing the​​ punctual Signorini law across​​​‌ an entire contact surface,‌ thereby bridging the gap‌​‌ between discrete and continuous​​​‌ contact models. A geometric​ interpretation reveals that the​‌ framework naturally captures three​​ physical regimes -sticking, separating,​​​‌ and tilting-within a unified​ complementarity structure. This leads​‌ to a principled extension​​ of the classical center​​​‌ of pressure, which we​ refer to as the​‌ extended center of pressure.​​ By establishing this connection,​​​‌ our work provides a​ mathematically consistent and computationally​‌ tractable foundation for handling​​ planar contacts, with implications​​​‌ for both the accurate​ simulation of contact dynamics​‌ and the design of​​ advanced control and optimization​​​‌ algorithms in locomotion and​ manipulation.

8.2.10 Reference-Free Sampling-Based​‌ Model Predictive Control

Participants:​​ Fabian Schramm, Pierre​​​‌ Fabre, Nicolas Perrin-Gilbert​, Justin Carpentier.​‌

This paper 35 presents​​ a sampling-based model predictive​​​‌ control (MPC) framework that​ enables emergent locomotion without​‌ relying on handcrafted gait​​ patterns or predefined contact​​​‌ sequences. Our method discovers​ diverse motion patterns, ranging​‌ from trotting to galloping,​​ robust standing policies, jumping,​​​‌ and handstand balancing, purely​ through the optimization of​‌ high-level objectives. Building on​​ model predictive path integral​​​‌ (MPPI), we propose a​ dual-space spline parameterization that​‌ operates on position and​​ velocity control points. Our​​​‌ approach enables contact-making and​ contact-breaking strategies that adapt​‌ automatically to task requirements,​​ requiring only a limited​​​‌ number of sampled trajectories.​ This sample efficiency allows​‌ us to achieve real-time​​ control on standard CPU​​​‌ hardware, eliminating the need​ for GPU acceleration typically​‌ required by other state-of-the-art​​ MPPI methods. We validate​​​‌ our approach on the​ Go2 quadrupedal robot, demonstrating​‌ various emergent gaits and​​ basic jumping capabilities. In​​​‌ simulation, we further showcase​ more complex behaviors, such​‌ as backflips, dynamic handstand​​ balancing and locomotion on​​​‌ a Humanoid, all without​ requiring reference tracking or​‌ offline pre-training.

Figure 12

Overview of​​ the framework showing the​​​‌ dual-spline parametrization, noise schedule​ and reference-free costs.

Figure​‌ 12: Overview of​​ the framework showing the​​​‌ dual-spline parametrization, noise schedule​ and reference-free costs.

8.2.11​‌ Guided Flow Policy: Learning​​ from High-Value Actions in​​​‌ Offline Reinforcement Learning

Participants:​ Franki Nguimatsia Tiofack,​‌ Théotime Le Hellard,​​ Fabian Schramm, Nicolas​​​‌ Perrin-Gilbert, Justin Carpentier​.

Offline reinforcement learning​‌ often relies on behavior​​ regularization that enforces policies​​​‌ to remain close to​ the dataset distribution. However,​‌ such approaches fail to​​ distinguish between high-value and​​​‌ low-value actions in their​ regularization components. We introduce​‌ Guided Flow Policy (GFP)​​ 34, which couples​​​‌ a multistep flow-matching policy​ with a distilled one-step​‌ actor. The actor directs​​ the flow policy through​​​‌ weighted behavior cloning to​ focus on cloning high-value​‌ actions from the dataset​​ rather than indiscriminately imitating​​​‌ all state-action pairs. In​ turn, the flow policy​‌ constrains the actor to​​ remain aligned with the​​​‌ dataset's best transitions while​ maximizing the critic. This​‌ mutual guidance enables GFP​​ to achieve state-of-the-art performance​​​‌ across 144 state and​ pixel-based tasks from the​‌ OGBench, Minari, and D4RL​​ benchmarks, with substantial gains​​​‌ on suboptimal datasets and​ challenging tasks.

Figure 13

Overview of​‌ the Guided Flow Policy​​ framework.

Figure 13:​​​‌ Overview of the Guided​ Flow Policy framework. GFP​‌ consists of three main​​ components: (i) in yellow,​​ VaBC, a multi-step flow​​​‌ policy πω trained‌ via weighted BC using‌​‌ the guidance term g​​η, (ii) in​​​‌ green, a one-step actor‌ πθ distilled from‌​‌ the flow policy, and​​ (iii) in gray, a​​​‌ critic Qϕ guiding‌ action evaluation. πω‌​‌ regularizes the actor toward​​ high-value actions from the​​​‌ dataset; in turn, the‌ actor shapes the flow‌​‌ and optimizes the critic​​ following the actor–critic approach.​​​‌ The different components of‌ the figure are introduced‌​‌ throughout the paper. Each​​ drawing represents the probability​​​‌ distribution of actions a‌𝒜 of a‌​‌ policy, in a current​​ state s, except​​​‌ for the gray ones,‌ where it is the‌​‌ value of actions a​​𝒜 in state​​​‌ s, according to‌ the critic.

8.2.12 Contact-Implicit‌​‌ Inverse Dynamics

Participants: Etienne​​ Ménager, Pierre Fabre​​​‌, Antoinre Bambade,‌ Wilson Jallet, Alberto‌​‌ De Marchi, Justin​​ Carpentier.

Task-space inverse​​​‌ dynamics, also known as‌ operational space control, is‌​‌ a popular control paradigm​​ for controlling robots in​​​‌ real-time. It enables the‌ control or stabilization of‌​‌ robot dynamics around reference​​ trajectories while accounting for​​​‌ under-actuation, actuator limits, and‌ contact interactions. Over the‌​‌ past few decades, this​​ versatile control paradigm has​​​‌ been successfully deployed in‌ numerous robotics settings, ranging‌​‌ from quadrupeds and humanoid​​ robots to deformable robots,​​​‌ in scenarios involving rich‌ physical contact interactions between‌​‌ a robot and its​​ environment. In practice, contact-aware​​​‌ inverse dynamics controllers assume‌ that contact sequences are‌​‌ known in advance, typically​​ provided by a higher-level​​​‌ contact planner, which inherently‌ limits their ability to‌​‌ select among breaking, sliding,​​ or sticking contacts automatically.​​​‌

In this paper 51‌, we extend the‌​‌ control formalism of task-space​​ inverse dynamics, which is​​​‌ classically formulated as a‌ quadratic program, to a‌​‌ more general quadratic program​​ with complementarity constraints (QPCC).​​​‌ This formulation fully accounts‌ for actuator limits and‌​‌ frictional contacts, modeled as​​ nonlinear complementary constraints. To​​​‌ solve these QPCC problems,‌ we draw inspiration from‌​‌ the alternating direction method​​ of multipliers to devise​​​‌ an iterative optimization approach‌ that alternates between minimizing‌​‌ a smooth convex function​​ that accounts for task​​​‌ objectives and system dynamics,‌ and projecting over convex‌​‌ and non-convex sets that​​ capture actuator and complementary​​​‌ frictional contact constraints. By‌ notably handling complementary frictional‌​‌ contact constraints through projection,​​ our approach enables us​​​‌ to implicitly and automatically‌ reason about the optimal‌​‌ contact modes that fulfill​​ the task objectives and​​​‌ constraints. We have implemented‌ our QPCC solver in‌​‌ C++ for efficiency, and​​ demonstrate its usability and​​​‌ versatility on rigid and‌ soft robots across various‌​‌ control scenarios, ranging from​​ the control of actuated​​​‌ box sliding on the‌ grounds, to control balance‌​‌ of legged robots that​​ automatically break and create​​​‌ contacts (e.g., jumping tasks,‌ balancing tasks) or control‌​‌ of deformable robots interacting​​ with their environment.

Figure 14

The​​​‌ proposed method generically handles‌ inverse dynamics with frictional‌​‌ contact for both rigid​​ and soft robots.

Figure​​​‌ 14: The proposed‌ method generically handles inverse‌​‌ dynamics with frictional contact​​​‌ for both rigid and​ soft robots.Standing humanoid​‌ (left). The humanoid tracks​​ an unstable reference configuration​​​‌ (standing on its left​ leg). The only cost​‌ terms are the reference​​ tracking cost and regulating​​​‌ the angular momentum around​ zero. Deformable robot (right).​‌ The robot controls a​​ ball (in blue) by​​​‌ deforming its body. The​ task is formulated as​‌ minimizing the distance between​​ a reference position and​​​‌ the ball's center, and​ solved using the robot-ball​‌ coupling via frictional contacts.​​

8.3 A Data-driven Contact​​​‌ Estimation Method for Wheeled-Biped​ Robots

Participants: Ü. Bora​‌ Gökbakan, Frederike Dümbgen​​, Stéphane Caron.​​​‌

Contact estimation is a​ key ability for limbed​‌ robots, where making and​​ breaking contacts has a​​​‌ direct impact on state​ estimation and balance control.​‌ Existing approaches typically rely​​ on gate-cycle priors or​​​‌ designated contact sensors. We​ design a contact estimator​‌ that is suitable for​​ the emerging wheeled-biped robot​​​‌ types that do not​ have these features. To​‌ this end, we propose​​ a Bayes filter 45​​​‌ in which update steps​ are learned from real-robot​‌ torque measurements while prediction​​ steps rely on inertial​​​‌ measurements. We evaluate this​ approach in extensive real-robot​‌ and simulation experiments. Our​​ method achieves better performance​​​‌ while being considerably more​ sample efficient than a​‌ comparable deep-learning baseline.

Figure 15

Robustly​​ detecting the moments when​​​‌ a wheeled-biped robot makes​ and breaks contact is​‌ crucial for successful estimation​​ and control. This paper​​​‌ proposes a contact estimator​ based only on inertial​‌ and torque measurements. The​​ measurements are fed into​​​‌ a novel Bayesian filter​ formulation to robustly estimate​‌ the binary contact state.​​ We validate our results​​​‌ extensively both in simulation​ and real-world experiments, as​‌ depicted in the bottom​​ figure.

Figure 15:​​​‌ Robustly detecting the moments​ when a wheeled-biped robot​‌ makes and breaks contact​​ is crucial for successful​​​‌ estimation and control. This​ paper proposes a contact​‌ estimator based only on​​ inertial and torque measurements.​​​‌ The measurements are fed​ into a novel Bayesian​‌ filter formulation to robustly​​ estimate the binary contact​​​‌ state. We validate our​ results extensively both in​‌ simulation and real-world experiments,​​ as depicted in the​​​‌ bottom figure.

8.3.1 End-to-End​ and Highly-Efficient Differentiable Simulation​‌ for Robotics

Participants: Quentin​​ Le Lidec, Louis​​​‌ Montaut, Yann de​ Mont-Marin, Justin Carpentier​‌.

Over the past​​ few years, robotics simulators​​​‌ have largely improved in​ efficiency and scalability, enabling​‌ them to generate years​​ of simulated data in​​​‌ a few hours. Yet,​ efficiently and accurately computing​‌ the simulation derivatives remains​​ an open challenge, with​​​‌ potentially high gains on​ the convergence speed of​‌ reinforcement learning and trajectory​​ optimization algorithms, especially for​​​‌ problems involving physical contact​ interactions. This paper 49​‌ contributes to this objective​​ by introducing a unified​​​‌ and efficient algorithmic solution​ for computing the analytical​‌ derivatives of robotic simulators.​​ The approach considers both​​​‌ the collision and frictional​ stages, accounting for their​‌ intrinsic nonsmoothness and also​​ exploiting the sparsity induced​​​‌ by the underlying multibody​ systems. These derivatives have​‌ been implemented in C++,​​ and the code will​​ be open-sourced in the​​​‌ Simple simulator. They depict‌ state-of-the-art timings ranging from‌​‌ 5 us for a​​ 7-dof manipulator up to​​​‌ 95 us for 36-dof‌ humanoid, outperforming alternative solutions‌​‌ by a factor of​​ at least 100.

Figure 16

Illustration​​​‌ of the sliding mode.‌

Figure 16: Illustration‌​‌ of the sliding mode.​​ λ* lives in​​​‌ the boundary of the‌ cone Kμ in‌​‌ the direction opposite to​​ σ=σT​​​‌ and the variation d‌λ* lies inside‌​‌ the tangent plane.

8.3.2​​ Differentiable Simulation of Soft​​​‌ Robots with Frictional Contacts‌

Participants: Etienne Ménager,‌​‌ Louis Montaut, Quentin​​ Le Lidec, Justin​​​‌ Carpentier.

In recent‌ years, soft robotics simulators‌​‌ have evolved to offer​​ various functionalities, including the​​​‌ simulation of different material‌ types (e.g., elastic, hyper-elastic)‌​‌ and actuation methods (e.g.,​​ pneumatic, cable-driven, servomotor). These​​​‌ simulators also provide tools‌ for various tasks, such‌​‌ as calibration, design, and​​ control. However, efficiently and​​​‌ accurately computing derivatives within‌ these simulators remains a‌​‌ challenge, particularly in the​​ presence of physical contact​​​‌ interactions. Incorporating these derivatives‌ can, for instance, significantly‌​‌ improve the convergence speed​​ of control methods like​​​‌ reinforcement learning and trajectory‌ optimization, enable gradient-based techniques‌​‌ for design, or facilitate​​ end-to-end machine-learning approaches for​​​‌ model reduction. This paper‌ 29 addresses these challenges‌​‌ by introducing a unified​​ method for computing the​​​‌ derivatives of mechanical equations‌ within the finite element‌​‌ method framework, including contact​​ interactions modeled as a​​​‌ nonlinear complementarity problem. The‌ proposed approach handles both‌​‌ collision and friction phases,​​ accounts for their nonsmooth​​​‌ dynamics, and leverages the‌ sparsity introduced by mesh-based‌​‌ models. Its effectiveness is​​ demonstrated through several examples​​​‌ of controlling and calibrating‌ soft systems.

8.3.3 Constrained‌​‌ Articulated Body Algorithms for​​ Closed-Loop Mechanisms

Participants: Ajay​​​‌ Suresha Sathya, Justin‌ Carpentier.

Efficient rigid-body‌​‌ dynamics algorithms are instrumental​​ in enabling high-frequency dynamics​​​‌ evaluation for resource-intensive applications‌ (e.g., model predictive control,‌​‌ large-scale simulation, reinforcement learning),​​ potentially on resource-constrained hardware.​​​‌ Existing recursive algorithms with‌ low computational complexity are‌​‌ mostly restricted to kinematic​​ trees with external contact​​​‌ constraints or are sensitive‌ to singular cases (e.g.,‌​‌ linearly dependent constraints and​​ kinematic singularities), severely impacting​​​‌ their practical usage in‌ existing simulators. This article‌​‌ 54 introduces two original​​ lowcomplexity recursive algorithms, loop-constrained​​​‌ articulated body algorithm (LCABA)‌ and proxBBO, based on‌​‌ proximal dynamics formulation for​​ forward simulation of mechanisms​​​‌ with loops. These algorithms‌ are derived from first‌​‌ principles using non-serial dynamic​​ programming, depict linear complexity​​​‌ in practical scenarios, and‌ are numerically robust to‌​‌ singular cases. They extend​​ the existing constrained articulated​​​‌ body algorithm (constrainedABA) to‌ handle internal loops and‌​‌ the pioneering BBO algorithm​​ from the 1980s to​​​‌ singular cases. Both algorithms‌ have been implemented by‌​‌ leveraging the open-source Pinocchio​​ library, benchmarked in detail,​​​‌ and depict state-ofthe-art performance‌ for various robot topologies,‌​‌ including over 6x speed-ups​​ compared to existing non-recursive​​​‌ algorithms for high degree-of-freedom‌ systems with internal loops‌​‌ such as recent humanoid​​ robots.

8.3.4 A Data-driven​​​‌ Contact Estimation Method for‌ Wheeled-Biped Robots

Participants: Ü.‌​‌ Bora Gökbakan, Frederike​​​‌ Dümbgen, Stéphane Caron​.

Contact estimation is​‌ a key ability for​​ limbed robots, where making​​​‌ and breaking contacts has​ a direct impact on​‌ state estimation and balance​​ control. Existing approaches typically​​​‌ rely on gate-cycle priors​ or designated contact sensors.​‌ In this work 45​​, we design a​​​‌ contact estimator that is​ suitable for the emerging​‌ wheeled-biped robot types that​​ do not have these​​​‌ features. To this end,​ we propose a Bayes​‌ filter in which update​​ steps are learned from​​​‌ real-robot torque measurements while​ prediction steps rely on​‌ inertial measurements. We evaluate​​ this approach in extensive​​​‌ real-robot and simulation experiments.​ Our method achieves better​‌ performance while being considerably​​ more sample efficient than​​​‌ a comparable deep-learning baseline.​

Figure 17

Robustly detecting the moments​‌ when a wheeled-biped robot​​ makes and breaks contact​​​‌ is crucial for successful​ estimation and control. This​‌ paper proposes a contact​​ estimator based only on​​​‌ inertial and torque measurements.​ The measurements are fed​‌ into a novel Bayesian​​ filter formulation to robustly​​​‌ estimate the binary contact​ state. We validate our​‌ results extensively both in​​ simulation and real-world experiments,​​​‌ as depicted in the​ bottom figure.

Figure 17​‌: Robustly detecting the​​ moments when a wheeled-biped​​​‌ robot makes and breaks​ contact is crucial for​‌ successful estimation and control.​​ This paper proposes a​​​‌ contact estimator based only​ on inertial and torque​‌ measurements. The measurements are​​ fed into a novel​​​‌ Bayesian filter formulation to​ robustly estimate the binary​‌ contact state. We validate​​ our results extensively both​​​‌ in simulation and real-world​ experiments, as depicted in​‌ the bottom figure.

8.3.5​​ Guardian: Detecting Robotic Planning​​​‌ and Execution Errors with​ Vision-Language Models

Participants: Paul​‌ Pacaud, Ricardo Garcia​​, Shizhe Chen,​​​‌ Cordelia Schmid.

This​ paper 52 addresses the​‌ problem of reliable failure​​ detection and recovery in​​​‌ Robotic Manipulation. Although current​ Vision-Language Models (VLMs) show​‌ promise, their accuracy and​​ generalization are limited by​​​‌ the scarcity of failure​ data. To address this​‌ data gap, we propose​​ an automatic robot failure​​​‌ synthesis approach that procedurally​ perturbs successful trajectories to​‌ generate diverse planning and​​ execution failures. This method​​​‌ produces not only binary​ classification labels but also​‌ fine-grained failure categories and​​ step-by-step reasoning traces in​​​‌ both simulation and the​ real world. With it,​‌ we construct three new​​ failure detection benchmarks: RLBench-Fail,​​​‌ BridgeDataV2-Fail, and UR5-Fail, substantially​ expanding the diversity and​‌ scale of existing failure​​ datasets. We then train​​​‌ Guardian, a VLM with​ multi-view images for detailed​‌ failure reasoning and detection,​​ as illustrated in Fig.​​​‌ 18. Guardian achieves​ state-of-the-art performance on both​‌ existing and newly introduced​​ benchmarks. It also effectively​​​‌ improves task success rates​ when integrated into a​‌ state-of-the-art manipulation system in​​ simulation and real robots,​​​‌ demonstrating the impact of​ our generated failure data.​‌

Figure 18

llustration of our Guardian​​ model - a VLM​​​‌ fine-tuned on our constructed​ failure datasets. It detects​‌ planning failures (top) and​​ execution failures (bottom) in​​​‌ robotic manipulation.

Figure 18​: llustration of our​‌ Guardian model - a​​ VLM fine-tuned on our​​ constructed failure datasets. It​​​‌ detects planning failures (top)‌ and execution failures (bottom)‌​‌ in robotic manipulation.

8.3.6​​ Augmented Lagrangian methods for​​​‌ infeasible convex optimization problems‌ and diverging proximal-point algorithms‌​‌

Participants: Roland Andrews,​​ Justin Carpentier, Adrien​​​‌ Taylor.

This work‌ investigates the convergence behavior‌​‌ of augmented Lagrangian methods​​ (ALMs) when applied to​​​‌ convex optimization problems that‌ may be infeasible. ALMs‌​‌ are a popular class​​ of algorithms for solving​​​‌ constrained optimization problems. We‌ establish progressively stronger convergence‌​‌ results, ranging from basic​​ sequence convergence to precise​​​‌ convergence rates, under a‌ hierarchy of assumptions. In‌​‌ particular, we demonstrate that,​​ under mild assumptions, the​​​‌ sequences of iterates generated‌ by ALMs converge to‌​‌ solutions of the “closest​​ feasible problem”.

This study​​​‌ leverages the classical relationship‌ between ALMs and the‌​‌ proximal-point algorithm applied to​​ the dual problem. A​​​‌ key technical contribution is‌ a set of concise‌​‌ results on the behavior​​ of the proximal-point algorithm​​​‌ when applied to functions‌ that may not have‌​‌ minimizers. These results pertain​​ to its convergence in​​​‌ terms of its subgradients‌ and of the values‌​‌ of the convex conjugate.​​

8.3.7 Certifiably optimal rotation​​​‌ and pose estimation based‌ on the Cayley map‌​‌

Participants: Timothy D Barfoot​​, Connor Holmes,​​​‌ Frederike Dümbgen.

We‌ present novel, convex relaxations‌​‌ for rotation and pose​​ estimation problems that can​​​‌ a posteriori guarantee global‌ optimality for practical measurement‌​‌ noise levels. Some such​​ relaxations exist in the​​​‌ literature for specific problem‌ setups that assume the‌​‌ matrix von Mises-Fisher distribution​​ (a.k.a., matrix Langevin distribution​​​‌ or chordal distance) for‌ isotropic rotational uncertainty. However,‌​‌ another common way to​​ represent uncertainty for rotations​​​‌ and poses is to‌ define anisotropic noise in‌​‌ the associated Lie algebra.​​ Starting from a noise​​​‌ model based on the‌ Cayley map, we define‌​‌ our estimation problems, convert​​ them to Quadratically Constrained​​​‌ Quadratic Programs (QCQPs), then‌ relax them to Semidefinite‌​‌ Programs (SDPs), which can​​ be solved using standard​​​‌ interior-point optimization methods; global‌ optimality follows from Lagrangian‌​‌ strong duality. We first​​ show how to carry​​​‌ out basic rotation and‌ pose averaging. We then‌​‌ turn to the more​​ complex problem of trajectory​​​‌ estimation, which involves many‌ pose variables with both‌​‌ individual and inter-pose measurements​​ (or motion priors). Our​​​‌ contribution 12 is to‌ formulate SDP relaxations for‌​‌ all these problems based​​ on the Cayley map​​​‌ (including the identification of‌ redundant constraints) and to‌​‌ show them working in​​ practical settings. We hope​​​‌ our results can add‌ to the catalogue of‌​‌ useful estimation problems whose​​ solutions can be a​​​‌ posteriori guaranteed to be‌ globally optimal.

8.3.8 Human-Robot‌​‌ Co-Simulation method for upper​​ limb assistive force calculation​​​‌ using polytopes

Participants: Maël‌ Gallois, Mégane Millan‌​‌, Nicolas Vignais,​​ Sylvain Guégan, Marie​​​‌ Babel, Justin Carpentier‌, Charles Pontonnier.‌​‌

Assisting the upper limb​​ constitutes a significant challenge​​​‌ in the rehabilitation and‌ readaptation of individuals with‌​‌ neuromuscular and/or neurodegenerative disorders.​​ To address this issue,​​​‌ robotic devices such as‌ exoskeletons have been designed.‌​‌ However, the control of​​​‌ these devices remains intricate​ and challenging, particularly in​‌ the context of the​​ upper limb. The objective​​​‌ of this study 13​ is to define a​‌ method to compute the​​ assistance an exoskeleton should​​​‌ provide to the user,​ according to its capabilities.​‌ With the objective to​​ minimize the user voluntary​​​‌ effort, optimal assisting forces​ that minimize the torque​‌ the user must exert​​ along a movement while​​​‌ maximizing the forces provided​ at the user interfaces​‌ are computed. Polytopes (Skiric​​ 2023) are used to​​​‌ determine feasible sets.

8.3.9​ PROXQP: an Efficient and​‌ Versatile Quadratic Programming Solver​​ for Real-Time Robotics Applications​​​‌ and Beyond

Participants: Antoine​ Bambade, Fabian Schramm​‌, Sarah El Kazdadi​​, Stéphane Caron,​​​‌ Adrien Taylor, Justin​ Carpentier.

Convex Quadratic​‌ programming (QP) has become​​ a core component in​​​‌ the modern engineering toolkit,​ particularly in robotics, where​‌ QP problems are legions,​​ ranging from real-time whole-body​​​‌ controllers to planning and​ estimation algorithms. Many of​‌ those QPs need to​​ be solved at high​​​‌ frequency. Meeting timing requirements​ requires taking advantage of​‌ as many structural properties​​ as possible for the​​​‌ problem at hand. For​ instance, it is generally​‌ crucial to resort to​​ warm-starting to exploit the​​​‌ resemblance of consecutive control​ iterations. While a large​‌ range of off-the-shelf QP​​ solvers is available, only​​​‌ a few are suited​ to exploit problem structure​‌ and warm-starting capacities adequately.​​ In this work 11​​​‌, we propose the​ PROXQP algorithm, a new​‌ and efficient QP solver​​ that exploits QP structures​​​‌ by leveraging primal-dual augmented​ Lagrangian techniques. For convex​‌ QPs, PROXQP features a​​ global convergence guarantee to​​​‌ the closest feasible QP,​ an essential property for​‌ safe closedloop control. We​​ illustrate its practical performance​​​‌ on various standard robotic​ and control experiments, including​‌ a real-world closed-loop model​​ predictive control application. While​​​‌ originally tailored for robotics​ applications, we show that​‌ PROXQP also performs at​​ the level of state​​​‌ of the art on​ generic QP problems, making​‌ PROXQP suitable for use​​ as an off-the-shelf solver​​​‌ for regular applications beyond​ robotics.

8.3.10 Optimal Control​‌ of Walkers with Parallel​​ Actuation

Participants: Ludovic de​​​‌ Matteis, Virgile Batto​, Justin Carpentier,​‌ Nicolas Mansard.

Legged​​ robots with complex kinematic​​​‌ architectures, such as parallel​ linkages, offer significant advancements​‌ in mobility and efficiency.​​ However, generating versatile movements​​​‌ for these robots requires​ accurate dynamic modeling that​‌ reflects their specific mechanical​​ structures. Previous approaches often​​​‌ relied on simplified models,​ resulting in sub-optimal control,​‌ particularly in tasks requiring​​ the full actuator range.​​​‌ Here 28, we​ present a method that​‌ fully models the dynamics​​ of legged robots with​​​‌ parallel linkages, formulating their​ motion generation as an​‌ optimal control problem with​​ specific contact dynamics. We​​​‌ introduce 6D kinematic closure​ constraints and derive their​‌ analytical derivatives, enabling the​​ solver to exploit nonlinear​​​‌ transmission and the consequent​ variable actuator reduction. This​‌ approach reduces peak motor​​ torques and expands the​​​‌ usable range of actuator​ motion and force. We​‌ empirically demonstrate that fully​​ modeling the kinematics leads​​ to superior performance, especially​​​‌ in demanding tasks such‌ as fast walking and‌​‌ stair climbing. Beyond serialparallel​​ designs, our method also​​​‌ addresses motion generation for‌ fully-parallel walkers.

8.3.11 Extended‌​‌ URDF: Accounting for parallel​​ mechanism in robot description​​​‌

Participants: Virgile Batto,‌ Ludovic de Matteis,‌​‌ Nicolas Mansard.

Robotic​​ designs played an important​​​‌ role in recent advances‌ by providing powerful robots‌​‌ with complex mechanics. Many​​ recent systems rely on​​​‌ parallel actuation to provide‌ lighter limbs and allow‌​‌ more complex motion. However,​​ these emerging architectures fall​​​‌ outside the scope of‌ most used description formats,‌​‌ leading to difficulties when​​ designing, storing, and sharing​​​‌ the models of these‌ systems. This paper 18‌​‌ introduces an extension to​​ the widely used Unified​​​‌ Robot Description Format (URDF)‌ to support closed-loop kinematic‌​‌ structures. Our approach relies​​ on augmenting URDF with​​​‌ minimal additional information to‌ allow more efficient modeling‌​‌ of complex robotic systems​​ while maintaining compatibility with​​​‌ existing design and simulation‌ frameworks. This method sets‌​‌ the basic requirement for​​ a description format to​​​‌ handle parallel mechanisms efficiently.‌ We demonstrate the applicability‌​‌ of our approach by​​ providing an open-source collection​​​‌ of parallel robots, along‌ with tools for generating‌​‌ and parsing this extended​​ description format. The proposed​​​‌ extension simplifies robot modeling,‌ reduces redundancy, and improves‌​‌ usability for advanced robotic​​ applications.

8.3.12 PROXDDP: Proximal​​​‌ Constrained Trajectory Optimization

Participants:‌ Wilson Jallet, Antoine‌​‌ Bambade, Etienne Arlaud​​, Sarah El Kazdadi​​​‌, Nicolas Mansard,‌ Justin Carpentier.

Trajectory‌​‌ optimization has been a​​ popular choice for motion​​​‌ generation and control in‌ robotics for at least‌​‌ a decade. Several numerical​​ approaches have exhibited the​​​‌ required speed to enable‌ online computation of trajectories‌​‌ for real-time of various​​ systems, including complex robots.​​​‌ Many of these said‌ are based on the‌​‌ differential dynamic programming (DDP)​​ algorithm – initially designed​​​‌ for unconstrained trajectory optimization‌ problems – and its‌​‌ variants, which are relatively​​ easy to implement and​​​‌ provide good runtime performance.‌ However, several problems in‌​‌ robot control call for​​ using constrained formulations (e.g.​​​‌ torque limits, obstacle avoidance),‌ from which several difficulties‌​‌ arise when trying to​​ adapt DDP-type methods: numerical​​​‌ stability, computational efficiency, and‌ constraint satisfaction.In this article‌​‌ 14, we leverage​​ proximal methods for constrained​​​‌ optimization and introduce a‌ DDP-type method for fast,‌​‌ constrained trajectory optimization suited​​ for model-predictive control (MPC)​​​‌ applications with easy warm-starting.Compared‌ to earlier solvers, our‌​‌ approach effectively manages hard​​ constraints without warm-start limitations​​​‌ and exhibits good convergence‌ behavior. We provide a‌​‌ complete implementation as part​​ of an open-source and​​​‌ flexible C++ trajectory optimization‌ library called ALIGATOR. These‌​‌ algorithmic contributions are validated​​ through several trajectory planning​​​‌ scenarios from the robotics‌ literature and the real-time‌​‌ whole-body MPC of a​​ quadruped robot.

8.3.13 Structure-Exploiting​​​‌ Sequential Quadratic Programming for‌ Model-Predictive Control

Participants: Armand‌​‌ Jordana, Sébastien Kleff​​, Avadesh Meduri,​​​‌ Justin Carpentier, Nicolas‌ Mansard, Ludovic Righetti‌​‌.

The promise of​​ model-predictive control in robotics​​​‌ has led to extensive‌ development of efficient numerical‌​‌ optimal control solvers in​​​‌ line with differential dynamic​ programming because it exploits​‌ the sparsity induced by​​ time. In this work​​​‌ 15, we argue​ that this effervescence has​‌ hidden the fact that​​ sparsity can be equally​​​‌ exploited by standard nonlinear​ optimization. In particular, we​‌ show how a tailored​​ implementation of sequential quadratic​​​‌ programming achieves state-of-the-art model-predictive​ control. Then, we clarify​‌ the connections between popular​​ algorithms from the robotics​​​‌ community and well-established optimization​ techniques. Further, the sequential​‌ quadratic program formulation naturally​​ encompasses the constrained case,​​​‌ a notoriously difficult problem​ in the robotics community.​‌ Specifically, we show that​​ it only requires a​​​‌ sparsity-exploiting implementation of a​ state-of-the-art quadratic programming solver.​‌ We illustrate the validity​​ of this approach in​​​‌ a comparative study and​ experiments on a torque-controlled​‌ manipulator. To the best​​ of our knowledge, this​​​‌ is the first demonstration​ of closed loop nonlinear​‌ model-predictive control with constraints​​ on a real robot.​​​‌

8.3.14 Modeling, Embedded Control​ and Design of Soft​‌ Robots using a Learned​​ Condensed FEM Model

Participants:​​​‌ Tanguy Navez, Etienne​ Ménager, Paul Chaillou​‌, Olivier Goury,​​ Alexandre Kruszewski, Christian​​​‌ Duriez.

The Finite​ Element Method (FEM) is​‌ a powerful modeling tool​​ for predicting soft robots'​​​‌ behavior, but its computation​ time can limit practical​‌ applications. In this paper​​ 16, a learning-based​​​‌ approach based on condensation​ of the FEM model​‌ is detailed. The proposed​​ method handles several kinds​​​‌ of actuators and contacts​ with the environment. We​‌ demonstrate that this compact​​ model can be learned​​​‌ as a unified model​ across several designs and​‌ remains very efficient in​​ terms of modeling since​​​‌ we can deduce the​ direct and inverse kinematics​‌ of the robot. Building​​ upon the intuition introduced​​​‌ in [11], the learned​ model is presented as​‌ a general framework for​​ modeling, controlling, and designing​​​‌ soft manipulators. First, the​ method's adaptability and versatility​‌ are illustrated through optimizationbased​​ control problems involving positioning​​​‌ and manipulation tasks with​ mechanical contact-based coupling. Secondly,​‌ the lowmemory consumption and​​ the high prediction speed​​​‌ of the learned condensed​ model are leveraged for​‌ real-time embedding control without​​ relying on costly online​​​‌ FEM simulation. Finally, the​ ability of the learned​‌ condensed FEM model to​​ capture soft robot design​​​‌ variations and its differentiability​ are leveraged in calibration​‌ and design optimization applications.​​

8.3.15 Infinite-Horizon Value Function​​​‌ Approximation for Model Predictive​ Control

Participants: Armand Jordana​‌, Sébastien Kleff,​​ Arthur Haffemayer, Joaquim​​​‌ Ortiz-Haro, Justin Carpentier​, Nicolas Mansard,​‌ Ludovic Righetti.

Model​​ Predictive Control has emerged​​​‌ as a popular tool​ for robots to generate​‌ complex motions. However, the​​ real-time requirement has limited​​​‌ the use of hard​ constraints and large preview​‌ horizons, which are necessary​​ to ensure safety and​​​‌ stability. In practice, practitioners​ have to carefully design​‌ cost functions that can​​ imitate an infinite horizon​​​‌ formulation, which is tedious​ and often results in​‌ local minima. In this​​ work 17, we​​​‌ study how to approximate​ the infinite horizon value​‌ function of constrained optimal​​ control problems with neural​​ networks using value iteration​​​‌ and trajectory optimization. Furthermore,‌ we demonstrate how using‌​‌ this value function approximation​​ as a terminal cost​​​‌ provides global stability to‌ the model predictive controller.‌​‌ The approach is validated​​ on two toy problems​​​‌ and a real-world scenario‌ with online obstacle avoidance‌​‌ on an industrial manipulator​​ where the value function​​​‌ is conditioned to the‌ goal and obstacle.

8.4‌​‌ Image restoration and enhancement​​

8.4.1 A New Statistical​​​‌ Model of Star Speckles‌ for Learning to Detect‌​‌ and Characterize Exoplanets in​​ Direct Imaging Observations

Participants:​​​‌ Théo Bodrito, Olivier‌ Flasseur, Julien Mairal‌​‌, Jean Ponce,​​ Maud Langlois, Anne-Marie​​​‌ Lagrange.

The search‌ for exoplanets is an‌​‌ active field in astronomy,​​ with direct imaging as​​​‌ one of the most‌ challenging methods due to‌​‌ faint exoplanet signals buried​​ within stronger residual starlight.​​​‌ Successful detection requires advanced‌ image processing to separate‌​‌ the exoplanet signal from​​ this nuisance component. This​​​‌ paper 19 presents a‌ novel statistical model that‌​‌ captures nuisance fluctuations using​​ a multi-scale approach, leveraging​​​‌ problem symmetries and a‌ joint spectral channel representation‌​‌ grounded in physical principles.​​ Our model integrates into​​​‌ an interpretable, end-to-end learnable‌ framework for simultaneous exoplanet‌​‌ detection and flux estimation.​​ The proposed algorithm is​​​‌ evaluated against the state‌ of the art using‌​‌ datasets from the SPHERE​​ instrument operating at the​​​‌ Very Large Telescope (VLT).‌ It significantly improves the‌​‌ precision-recall trade-off, notably on​​ challenging datasets that are​​​‌ otherwise unusable by astronomers.‌ The proposed approach is‌​‌ computationally efficient, robust to​​ varying data quality, and​​​‌ well suited for large-scale‌ observational surveys.

8.4.2 Deep‌​‌ learning for exoplanet detection​​ and characterization by direct​​​‌ imaging at high contrast‌

Participants: Théo Bodrito,‌​‌ Olivier Flasseur, Julien​​ Mairal, Jean Ponce​​​‌, Maud Langlois,‌ Anne-Marie Lagrange.

Exoplanet‌​‌ imaging is a major​​ challenge in astrophysics due​​​‌ to the need for‌ high angular resolution and‌​‌ high contrast. We present​​ a multi-scale statistical model​​​‌ 20 for the nuisance‌ component corrupting multivariate image‌​‌ series at high contrast.​​ Integrated into a learnable​​​‌ architecture, it leverages the‌ physics of the problem‌​‌ and enables the fusion​​ of multiple observations of​​​‌ the same star in‌ a way that is‌​‌ optimal in terms of​​ detection signal-to-noise ratio. Applied​​​‌ to data from the‌ VLT/SPHERE instrument, the method‌​‌ significantly improves the detection​​ sensitivity and the accuracy​​​‌ of astrometric and photometric‌ estimation.

8.4.3 Joint statistical‌​‌ modeling and deep learning​​ for exoplanet detection and​​​‌ characterization by direct imaging‌ at high contrast

Participants:‌​‌ Théo Bodrito, Olivier​​ Flasseur, Julien Mairal​​​‌, Jean Ponce,‌ Maud Langlois, Anne-Marie‌​‌ Lagrange.

The detection​​ of exoplanets, the characterization​​​‌ of their atmospheres, and‌ the study of exoplanet‌​‌ formation mechanisms are major​​ current challenges in astrophysics.​​​‌ High-contrast direct imaging (HCI)‌ is one of the‌​‌ observational techniques of choice​​ to address these questions.​​​‌ However, such observations are‌ particularly demanding due to‌​‌ the extreme contrast levels​​ and angular resolution required.​​​‌ In addition to the‌ use of extreme adaptive‌​‌ optics and coronagraphs, advances​​​‌ in data science have​ become critical for analyzing​‌ these observations and disentangling​​ the signals of interest​​​‌ (exoplanets and circumstellar disks)​ from the strong nuisance​‌ component (speckles and noise)​​ that corrupts the data.​​​‌ In this context, we​ will present our recent​‌ developments in deep learning​​ applied to HCI 21​​​‌, aimed at the​ optimal and reliable extraction​‌ of astrophysical information from​​ multivariate observations (including spatial,​​​‌ temporal, spectral, and multi-epoch​ diversity). These approaches are​‌ based on a fine​​ modeling of the different​​​‌ components contributing to the​ total signal and incorporate​‌ physical domain knowledge as​​ prior information. Emphasis will​​​‌ be placed on (i)​ combining deep learning models​‌ with statistical modeling of​​ the nuisance, (ii) leveraging​​​‌ large archival datasets as​ a valuable source of​‌ diversity for tackling the​​ unmixing task, and (iii)​​​‌ jointly exploiting the spectral​ diversity of observations. Our​‌ methods are tailored to​​ the specific challenges of​​​‌ high-contrast imaging: (i) very​ low signal-to-noise ratios and​‌ non-stationary noise, (ii) detection​​ of rare events, and​​​‌ (iii) absence of ground​ truth. Using data from​‌ the VLT/SPHERE instrument, we​​ will show that these​​​‌ approaches enable fine modeling​ and effective subtraction of​‌ the nuisance component, leading​​ to reliable and nearly​​​‌ optimal estimates of the​ astrophysical quantities of interest.​‌ This results in significantly​​ improved detection sensitivity and​​​‌ more accurate astro-photometric characterization.​ The proposed approaches are​‌ also scalable and readily​​ applicable to large-scale surveys.​​​‌ Looking ahead, instruments on​ the next generation of​‌ thirty-meter-class telescopes will enable​​ the exploration of the​​​‌ innermost environments of Sun-like​ stars at unprecedented contrast​‌ levels. Achieving the associated​​ scientific goals will require​​​‌ addressing several data science​ challenges: (i) approaching the​‌ ultimate performance limits of​​ the instruments through optimal​​​‌ signal extraction, (ii) capturing​ complex, spatially structured nuisance​‌ exhibiting strong variability, and​​ (iii) building robust nuisance​​​‌ models that go beyond​ the limitations of angular​‌ differential imaging, particularly in​​ the vicinity of the​​​‌ host star. We will​ discuss these challenges in​‌ light of the methodological​​ developments presented.

8.4.4 Modèle​​​‌ statistique apprenable de mélange​ de distributions et fusion​‌ de données multivariées pour​​ l'imagerie d'exoplanètes

Participants: Théo​​​‌ Bodrito, Olivier Flasseur​, Julien Mairal,​‌ Jean Ponce, Maud​​ Langlois, Anne-Marie Lagrange​​​‌.

Exoplanet imaging is​ a major challenge in​‌ astrophysics due to the​​ high star-planet contrast. This​​​‌ paper 22 presents a​ multi-scale statistical model for​‌ the nuisance component corrupting​​ multivariate image series. Integrated​​​‌ into a learnable architecture,​ it leverages the physics​‌ of the problem and​​ enables the fusion of​​​‌ multiple observations of the​ same star. Applied to​‌ real data, the method​​ significantly improves the detection​​​‌ sensitivity and the accuracy​ of exoplanet position and​‌ flux estimation.

8.4.5 CoDEx:​​ Combining Domain Expertise for​​​‌ Spatial Generalization in Satellite​ Image Analysis

Participants: Abhishek​‌ Kuriyal, Elliot Vincent​​, Mathieu Aubry,​​​‌ Loic Landrieu.

Global​ variations in terrain appearance​‌ raise a major challenge​​ for satellite image analysis,​​​‌ leading to poor model​ performance when training on​‌ locations that differ from​​ those encountered at test​​ time. This remains true​​​‌ even with recent large‌ global datasets. To address‌​‌ this challenge, we propose​​ a novel domain-generalization framework​​​‌ for satellite images 26‌. Instead of trying‌​‌ to learn a single​​ generalizable model, we train​​​‌ one expert model per‌ training domain, while learning‌​‌ experts' similarity and encouraging​​ similar experts to be​​​‌ consistent. A model selection‌ module then identifies the‌​‌ most suitable experts for​​ a given test sample​​​‌ and aggregates their predictions.‌ Experiments on four datasets‌​‌ (DynamicEarthNet, MUDS, OSCD, and​​ FMoW) demonstrate consistent gains​​​‌ over existing domain generalization‌ and adaptation methods.

8.5‌​‌ Doctoral dissertations and habilitation​​ theses

8.5.1 Deep learning​​​‌ for exoplanet detection in‌ high contrast imaging

Participants:‌​‌ Théo Bodrito.

The​​ thesis 37 addresses the​​​‌ challenge of detecting and‌ characterizing exoplanets through direct‌​‌ imaging, a technique hindered​​ by the extreme contrast​​​‌ and small angular separation‌ between stars and planets.‌​‌ To overcome these issues,​​ this work introduces three​​​‌ hybrid approaches that combine‌ statistical modeling with deep‌​‌ learning, leveraging large datasets​​ from high-contrast imaging surveys.​​​‌ The deep PACO method‌ integrates a local statistical‌​‌ model of the nuisance​​ component with a convolutional​​​‌ neural network, improving detection‌ and characterization performance over‌​‌ classical algorithms. MODEL&CO further​​ advances this by learning​​​‌ a unique model across‌ a large multi-observations datasets,‌​‌ enabling robust detection even​​ in challenging conditions. ExoMILD​​​‌ introduces a multi-scale, multi-spectral‌ statistical framework that exploits‌​‌ spatial symmetries, achieving superior​​ sensitivity and unbiased parameter​​​‌ estimation. Extensive testing on‌ semi-synthetic and real datasets‌​‌ demonstrates significant gains in​​ contrast and robustness, particularly​​​‌ at small separations. The‌ approaches are designed to‌​‌ generalize across diverse observing​​ conditions and are well-suited​​​‌ for future large-scale surveys.‌ Overall, the thesis establishes‌​‌ a new generation of​​ deep learning-based tools for​​​‌ exoplanet imaging, enabling more‌ sensitive and reliable exploration‌​‌ of planetary systems.

8.5.2​​ Learning dexterous manipulation from​​​‌ 3D hand and object‌ interaction

Participants: Zerui Chen‌​‌.

In this thesis​​ 39, we advance​​​‌ the understanding of 3D‌ hand motions and hand-object‌​‌ interactions in monocular videos.​​ We show that how​​​‌ these insights can empower‌ robots with human-like dexterous‌​‌ manipulation capabilities. Our approach​​ achieves dense 3D reconstructions​​​‌ of both hands and‌ objects, capturing their fine-grained‌​‌ interactions while maintaining fast​​ inference speed. We investigate​​​‌ how to leverage these‌ 3D reconstruction results to‌​‌ transfer human manipulation skills​​ to multi-fingered robotic hands​​​‌ through trajectory-guided reinforcement learning‌ and vision-based imitation learning.‌​‌ By effectively connecting visual​​ motion capture with robotic​​​‌ execution, our work creates‌ new opportunities for human-robot‌​‌ collaboration. Our contributions are​​ structured into four key​​​‌ areas: First, we propose‌ a joint learning framework‌​‌ for 3D reconstruction of​​ hands and objects using​​​‌ signed distance functions (SDFs).‌ This method generates high-resolution‌​‌ meshes and captures detailed​​ hand-object interactions. Second, to​​​‌ improve the alignment between‌ the reconstructed 3D shape‌​‌ and its underlying poses,​​ we leverage hand kinematic​​​‌ structures to guide SDF-based‌ reconstruction, which helps enhance‌​‌ visual features and increase​​ robustness to occlusions. Third,​​​‌ while SDF-based methods yield‌ promising results, they are‌​‌ computationally intensive and often​​​‌ produce overly smooth surfaces.​ To address this, we​‌ introduce a novel transformer-based​​ approach for reconstructing dense​​​‌ point clouds of hand-held​ objects, achieving high-quality 3D​‌ reconstructions with fast inference​​ speed. Finally, although vision​​​‌ systems produce visually plausible​ 3D hand and object​‌ configurations, these configurations may​​ not always be physically​​​‌ plausible, which make them​ less useful for robot​‌ learning. To tackle this,​​ we develop ViViDex and​​​‌ first employ reinforcement learning​ to refine these noisy​‌ configurations. Then, we apply​​ imitation learning to train​​​‌ a unified vision-based policy​ from refined trajectories. As​‌ a result, ViViDex generates​​ natural manipulation sequences and​​​‌ demonstrates superior performance across​ three dexterous manipulation tasks.​‌

8.5.3 Analysis of satellite​​ image time series for​​​‌ classification and change detection​

Participants: Elliot Vincent.​‌

This thesis 41 develops​​ machine learning methods for​​​‌ analyzing time series of​ satellite images (STIS), focusing​‌ on soil classification and​​ semantic change detection. We​​​‌ propose three main areas​ of improvement. First, we​‌ design architectures specifically tailored​​ for STIS: DTI-TS for​​​‌ agricultural classification, multiUTAE for​ change detection, and a​‌ combination of a foundation​​ model with a temporal​​​‌ attention model for detecting​ archaeological looting. Second, we​‌ address the lack of​​ annotated data by developing​​​‌ weakly supervised methods and​ introducing a dataset for​‌ archaeological looting detection. Finally,​​ we examine the impact​​​‌ of spatial and temporal​ domain shifts on model​‌ performance. The DTI-TS method​​ aligns time series prototypes​​​‌ with data using spectral​ and temporal transformations. It​‌ excels in contexts with​​ temporal shifts and data​​​‌ scarcity while maintaining good​ interpretability. MultiUTAE segments all​‌ images in a series​​ simultaneously, leveraging information over​​​‌ a broad temporal window.​ This sequence-to-sequence approach outperforms​‌ methods that process images​​ individually or in pairs​​​‌ across various domain shift​ scenarios. For archaeological looting​‌ detection, the thesis introduces​​ DAFA-LS, a dataset of​​​‌ Afghan sites. The best​ performance is achieved by​‌ a method combining a​​ pre-trained foundation model and​​​‌ an attention model. Future​ research directions include leveraging​‌ foundation models and multi-modality,​​ enhancing time series by​​​‌ improving resolution or adding​ elevation data, and developing​‌ unsupervised learning and domain​​ adaptation to mitigate the​​​‌ lack of annotated data.​

8.5.4 Object-centric representations for​‌ sensing and planning in​​ visually-guided robotics

Participants: Thomas​​​‌ Chabal.

This thesis​ 38 pursues the ultimate​‌ goal to develop autonomous​​ and intelligent robotic assistants​​​‌ with the abilities to​ perceive and understand the​‌ world, and explore and​​ act on it. Such​​​‌ systems would have to​ navigate to and interact​‌ with a variety of​​ individual components, or objects.​​​‌ Working from images, this​ thesis specifically aims at​‌ developing novel object-centric representations​​ and algorithms to perceive​​​‌ and reconstruct scenes, before​ planning robotic manipulation and​‌ navigation actions. It addresses​​ three main challenges: handling​​​‌ failures of visual systems​ and partial knowledge of​‌ goals in robotic assembly​​ tasks, efficiently acquiring accurate​​​‌ and complete object-level representations​ of scenes in an​‌ online fashion, and learning​​ to understand the semantics​​​‌ of human-arranged environments to​ explore and search for​‌ objects with a mobile​​ robot. We advance the​​ field through the following​​​‌ three contributions. First, we‌ study the problem of‌​‌ stacking objects with a​​ robotic manipulator to reproduce​​​‌ an assembly specified through‌ a single photograph. As‌​‌ visual systems encounter unavoidable​​ failures in analyzing images,​​​‌ notably due to occlusions,‌ the target structure is‌​‌ only partially known. We​​ present an approach intertwining​​​‌ an abstract search for‌ a high-level assembly plan‌​‌ and a physical grounding​​ of candidate plans in​​​‌ the real world. Our‌ method, deployed on a‌​‌ robotic manipulator, builds stable​​ structures that match the​​​‌ goal assemblies, known by‌ extracting object poses with‌​‌ an off-the-shelf procedure in​​ the goal image. Beyond​​​‌ fixed robots with a‌ limited access to observations‌​‌ of their surroundings, we​​ consider cameras that freely​​​‌ move in the physical‌ world and explore online‌​‌ scene reconstruction at the​​ level of objects from​​​‌ a stream of posed‌ RGB-D frames. We model‌​‌ objects as neural implicit​​ representations, entailing feature grids​​​‌ and small perceptrons, optimized‌ per scene with differentiable‌​‌ rendering. We propose a​​ feature grid interpolation scheme​​​‌ to adapt to novel‌ views of yet unseen‌​‌ object parts, as well​​ as a relocalization approach​​​‌ to reuse object models‌ in novel scenes and‌​‌ an update procedure synthesizing​​ views from past viewpoints​​​‌ to increase the completion‌ of reconstructed objects in‌​‌ novel sequences. Finally, we​​ focus on robot navigation​​​‌ towards objects specified as‌ categories in unknown environments,‌​‌ a task requiring accurate​​ scene understanding and efficient​​​‌ exploration. We introduce an‌ online frontier-object mapping with‌​‌ rich visual and semantic​​ representations of frontiers, or​​​‌ boundaries of the explored‌ area, and object instances.‌​‌ Our navigation strategy combines​​ a high-level goal prediction​​​‌ stage relying on a‌ vision-language model endowed with‌​‌ learnt navigation-specific encoders and​​ decoders and a low-level​​​‌ path planner that generates‌ trajectories. Our modular framework,‌​‌ dubbed FOM-Nav for Frontier-Object​​ Maps, is trained on​​​‌ an automatically self-collected large-scale‌ navigation dataset in scanned‌​‌ environments and significantly improves​​ exploration efficiency over prior​​​‌ works.

8.5.5 Learning Visuomotor‌ Policies for Robotic Manipulation‌​‌

Participants: Ricardo Garcia.​​

This thesis 40 focuses​​​‌ on the development of‌ representations and learning algorithms‌​‌ to perform visually-guided robotic​​ manipulation tasks in unstructured​​​‌ environments. One of the‌ final goals of robotics‌​‌ is to train robots​​ that can autonomously solve​​​‌ a wide range of‌ tasks in the real‌​‌ world based on human​​ instructions. To make progress​​​‌ towards this goal, this‌ manuscript covers three main‌​‌ challenges: closing the sim-to-real​​ gap for visuomotor policies,​​​‌ integrating 3D point cloud‌ representations with language instructions‌​‌ to improve the performance​​ of robotic manipulation in​​​‌ multi-task settings, and developing‌ a generalist language-guided visuomotor‌​‌ policy for robotic manipulation.​​ We first address the​​​‌ challenge of sim-to-real transfer‌ in robotic manipulation. Training‌​‌ policies in simulation is​​ less time-consuming and safer​​​‌ than in the real‌ world. However, the discrepancies‌​‌ between simulation and the​​ real environment can limit​​​‌ the transferability of policies‌ trained in simulation to‌​‌ the real world. This​​ issue is known as​​​‌ the sim-to-real gap, and‌ domain randomization (DR) is‌​‌ a known technique to​​​‌ address this gap. DR​ allows to perform sim-to-real​‌ policy transfer by randomizing​​ the simulation's appearance (textures,​​​‌ lighting, object colors, and​ camera viewpoints) during training.​‌ However, selecting the right​​ range of parameters randomization​​​‌ is not trivial. This​ thesis proposes a data-driven​‌ strategy to systematically select​​ the DR parameters using​​​‌ multi-object localization as a​ proxy task. We then​‌ focus on language-guided robotic​​ manipulation and propose PolarNet​​​‌ and 3D-LOTUS, two 3D​ point cloud-based methods to​‌ integrate visual inputs and​​ language instructions to predict​​​‌ manipulation actions. Both methods​ use efficient point cloud​‌ encoders and multimodal transformers​​ to combine the text​​​‌ instructions and the geometric​ information from point clouds,​‌ enabling more accurate and​​ efficient manipulation than 2D​​​‌ image-based approaches. Both 3D-based​ policies outperform state-of-the-art models​‌ across various multi-task settings​​ of the RLBench benchmark​​​‌ and successfully transfer to​ the real-world robot, highlighting​‌ their performance in diverse​​ environments. The last part​​​‌ of this thesis focuses​ on developing generalist robot​‌ policies for robotic manipulation.​​ First, we propose a​​​‌ GemBench, a comprehensive benchmark​ for evaluating the generalization​‌ capabilities of such policies​​ on a set of​​​‌ tasks with four levels​ of increasing difficulty: (1)​‌ novel object placements, (2)​​ novel rigid objects, (3)​​​‌ novel articulated objects, and​ (4) long-horizon tasks that​‌ require sequential planning. We​​ then propose 3D-LOTUS++, which​​​‌ extends our point cloud-based​ policy 3D-LOTUS by incorporating​‌ large language models (LLMs)​​ for task planning and​​​‌ vision-language models (VLMs) for​ object grounding. This modular​‌ framework achieves state-of-the-art performance​​ on this new benchmark.​​​‌ Through these contributions, this​ thesis advances the development​‌ of robust, precise, and​​ generalist visuomotor policies for​​​‌ robotic manipulation.

9 Bilateral​ contracts and grants with​‌ industry

9.1 Bilateral contracts​​ with industry

9.1.1 Louis​​​‌ Vuitton/ENS chair on artificial​ intelligence

Participants: Jean Ponce​‌.

The scientific chair​​ Louis Vuitton - École​​​‌ normale supérieure in Artificial​ Intelligence has been created​‌ in 2017 and inaugurated​​ on April 12, 2018​​​‌ by the ENS Director​ Marc Mézard and the​‌ LV CEO Michael Burke.​​ The goal of the​​​‌ chair is to establish​ a close collaboration between​‌ LV and ENS in​​ the area of Artificial​​​‌ Intelligence. The chair enjoys​ the generous annual contribution​‌ of 200K Euros provided​​ by LV in support​​​‌ of research activities in​ statistical learning and computer​‌ vision. In particular, the​​ chair supports the costs​​​‌ of researchers, students, missions,​ computational resources as well​‌ as seminars and meetings,​​ including the two days​​​‌ of meeting annually organized​ by LV and ENS.​‌ During 2020 ENS and​​ LV have organized several​​​‌ joint meetings with the​ participation of researchers from​‌ SIERRA and WILLOW teams.​​ The chair has also​​​‌ supported the hiring of​ one PhD student at​‌ the WILLOW team, missions​​ to conferences and international​​​‌ research labs as well​ as data collection for​‌ research projects. In 2020​​ the chair has been​​​‌ extended to the next​ three-year period until 2023.​‌ We are planning to​​ start a CIFRE PhD​​​‌ of François Gardères together​ with Louis Vuitton in​‌ 2023.

9.1.2 Casino/ENS chair​​ on algorithmic and machine​​ learning

Participants: Justin Carpentier​​​‌.

The scientific chair‌ Casino/ENS - École normale‌​‌ supérieure on algorithmic and​​ machine learning has been​​​‌ created in 2021. J.‌ Carpentier is in charge‌​‌ of the robotics axis​​ of this chair.

10​​​‌ Partnerships and cooperations

10.1‌ International research visitors

10.1.1‌​‌ Visits of international scientists​​

Other international visits to​​​‌ the team
Marc Toussaint‌
  • Status
    Professor
  • Institution of‌​‌ origin:
    TU Berlin
  • Country:​​
    Germany
  • Dates:
    1 month​​​‌
  • Context of the visit:‌
    collaboration
  • Mobility program/type of‌​‌ mobility:
    research stay
Mike​​ Tarr
  • Status
    Professor
  • Institution​​​‌ of origin:
    Carnegie Mellon‌ University
  • Country:
    US
  • Dates:‌​‌
    1 month
  • Context of​​ the visit:
    collaboration
  • Mobility​​​‌ program/type of mobility:
    sabbatical‌
Baohe Zhang
  • Status
    PhD‌​‌ student
  • Institution of origin:​​
    University of Freiburg
  • Country:​​​‌
    Germany
  • Dates:
    5 months‌
  • Context of the visit:‌​‌
    Collaboration on world modeling​​ for robotics
  • Mobility program/type​​​‌ of mobility:
    research stay‌

10.2 European initiatives

10.2.1‌​‌ Horizon Europe

AGIMUS

AGIMUS​​ project on cordis.europa.eu

  • Title:​​​‌
    Next generation of AI-powered‌ robotics for agile production‌​‌
  • Duration:
    From October 1,​​ 2022 to September 30,​​​‌ 2026
  • Partners:
    • INSTITUT NATIONAL‌ DE RECHERCHE EN INFORMATIQUE‌​‌ ET AUTOMATIQUE (INRIA), France​​
    • AIRBUS, France
    • KLEEMANN HELLAS​​​‌ SA (KLEEMANN HELLAS SA),‌ Greece
    • PAL ROBOTICS SLU‌​‌ (PAL ROBOTICS), Spain
    • Q-PLAN​​ INTERNATIONAL ADVISORS PC (Q-PLAN​​​‌ INTERNATIONAL), Greece
    • PAL FRANCE,‌ France
    • THIMM OBALY, K.S.,‌​‌ Czechia
    • CESKE VYSOKE UCENI​​ TECHNICKE V PRAZE (CVUT),​​​‌ Czechia
    • CENTRE NATIONAL DE‌ LA RECHERCHE SCIENTIFIQUE CNRS‌​‌ (CNRS), France
  • Inria contact:​​
    Justin Carpentier
  • Coordinator:
  • Summary:​​​‌
    AGIMUS aims to deliver‌ an open-source breakthrough innovation‌​‌ in AI-powered agile production,​​ introducing solutions that push​​​‌ the limits of perception,‌ planning, and control in‌​‌ robotics, enabling general-purpose robots​​ to be quick to​​​‌ set-up, autonomous and to‌ easily adapt to changes‌​‌ in the manufacturing process.​​ To achieve such agile​​​‌ production, AGIMUS leverages on‌ cutting-edge technologies and goes‌​‌ beyond the state-of-the-art to​​ equip current mobile manipulators​​​‌ with a combination of‌ (i) an advanced task‌​‌ and motion planner that​​ can learn from online​​​‌ available video demonstrations; (ii)‌ optimal control policies obtained‌​‌ from advances in reinforcement​​ learning based on efficient​​​‌ differentiable physics simulations of‌ the manufacturing process; as‌​‌ well as (iii) advanced​​ perception algorithms able to​​​‌ handle objects and situations‌ unseen during initial training.‌​‌ Along the way, optimization​​ of energy efficiency and​​​‌ the use of 5G‌ technology will support further‌​‌ pushing the limits of​​ autonomy. The AGIMUS solutions​​​‌ and their impact will‌ be demonstrated and thoroughly‌​‌ stress-tested in 3 testing​​ zones, as well as​​​‌ 3 industrial pilots in‌ Europe, under numerous diverse‌​‌ real-world case studies and​​ scenarios (different tools, environments,​​​‌ processes, etc.). In every‌ step, and from the‌​‌ very beginning, AGIMUS will​​ go beyond current norms​​​‌ and involve a wide‌ range of stakeholders, starting‌​‌ from the production line​​ itself, to identify the​​​‌ essential ethical-by-design principles and‌ guidelines that can maximise‌​‌ acceptance and impact.
ARTIFACT​​

ARTIFACT project on cordis.europa.eu​​​‌

  • Title:
    The Artificial Motion‌ Factory
  • Duration:
    From September‌​‌ 1, 2025 to August​​ 31, 2030
  • Partners:
    • INSTITUT​​​‌ NATIONAL DE RECHERCHE EN‌ INFORMATIQUE ET AUTOMATIQUE (INRIA),‌​‌ France
  • Inria contact:
    Justin​​​‌ Carpentier
  • Coordinator:
  • Summary:

    Todays​ robots are confined to​‌ tightly controlled environments: even​​ the complex choreographies that​​​‌ the Atlas humanoid flawlessly​ executes heavily rely on​‌ handcrafted control strategies and​​ detailed workspace models, with​​​‌ little place for sensing.​ To put it bluntly,​‌ robots are nowhere near​​ the level of agility,​​​‌ dexterity, and even less​ so autonomy, robustness, and​‌ safety required for their​​ deployment in the wild​​​‌ alongside people.

    The tenet​ of ARTIFACT is that​‌ the key to an​​ actual revolution will come​​​‌ from the algorithmic foundations​ of artificial motion intelligence,​‌ an AI challenged from​​ the start to interact​​​‌ physically with dynamic environments​ and, ultimately, people. To​‌ do so, we will​​ break away from the​​​‌ dichotomy between optimal control,​ where the role of​‌ perception is traditionally limited​​ to an early state​​​‌ estimation stage, and reinforcement​ learning, where control policies​‌ are typically learned model-free​​ with no guarantee to​​​‌ cope with the curse​ of dimensionality.

    In ARTIFACT,​‌ we will devise a​​ unified, structured, modular, and​​​‌ learnable control architecture for​ providing robots with advanced​‌ decision-making capabilities to solve​​ complex tasks and face​​​‌ new interactions as they​ experience the world. It​‌ will leverage the notion​​ of differentiable programming at​​​‌ all scales to enable​ robots to (i) capture​‌ models of their interactions​​ directly from a sound​​​‌ combination of sensor data​ and first principles from​‌ physics, (ii) autonomously discover​​ new complex gestures and​​​‌ movements leveraging their past​ experiences, and (iii) learn​‌ embodied representations to control​​ their interactions finely and​​​‌ reason about the physical​ world. It will be​‌ implemented in open-source software​​ and shown in real-world​​​‌ and challenging scenarios requiring​ fine dexterity and high​‌ agility. Altogether, these contributions​​ will be the key​​​‌ enablers to enhance robot​ autonomy fundamentally, thus opening​‌ the age of ubiquitous​​ robots at the service​​​‌ of mankind.

ExTRAORDiNary

ExTRAORDiNary​ project on cordis.europa.eu

  • Title:​‌
    Accelerating Differentiable Robot Dynamics​​ Simulation for Advanced Control​​​‌
  • Duration:
    From May 1,​ 2025 to April 30,​‌ 2027
  • Partners:
    • INSTITUT NATIONAL​​ DE RECHERCHE EN INFORMATIQUE​​​‌ ET AUTOMATIQUE (INRIA), France​
  • Inria contact:
    Justin Carpentier​‌
  • Coordinator:
  • Summary:

    Differentiable robot​​ dynamics simulation is a​​​‌ crucial enabler of advanced​ robot control. It is​‌ at the heart of​​ both model predictive control​​​‌ (MPC) and learning-based approaches​ (e.g. reinforcement learning [RL]),​‌ which are among the​​ most successful and actively​​​‌ researched robot control algorithms.​ Increased usage of the​‌ computationally demanding MPC/RL controllers​​ has led to a​​​‌ growing need for efficient​ dynamics simulators. However, existing​‌ simulators internally use inefficient​​ high-complexity (worst-case cubic) constrained​​​‌ dynamics algorithms (CDA) and​ are often inefficiently implemented​‌ leading to a slowdown​​ of several factors compared​​​‌ to a fast simulator​ like Pinocchio.

    Addressing these​‌ concerns, we will accelerate​​ the differentiable simulation through​​​‌ three complementary strategies. We​ will 1) leverage low-complexity​‌ CDAs, 2) use Pinocchio's​​ proven efficient software design​​​‌ patterns and explore further​ acceleration via code generation​‌ computations, and 3) derive​​ efficient algorithms for differentiating​​​‌ through contact simulation.

    Furthermore,​ our simulator will solve​‌ the nonlinear complementarity problem​​ of frictional contact without​​ making physics-compromising relaxations like​​​‌ existing simulators and will‌ be publicly available as‌​‌ part of the widely​​ used open-source Pinocchio library.​​​‌ By adding key enhancements‌ to Pinocchio, we will‌​‌ make it a viable​​ alternative to the inefficient,​​​‌ but feature-rich software simulators.‌ The visibility, impact and‌​‌ usability of our simulator​​ will be enhanced by​​​‌ addressing some low-hanging fruits‌ in MPC, RL and‌​‌ physics identification applications.

    This​​ projects contributions will not​​​‌ only pave the way‌ towards fast whole-body controllers‌​‌ and faster and more​​ sustainable RL training (important​​​‌ a time surge of‌ RL research activity), but‌​‌ will also impact adjacent​​ fields like bio-mechanics and​​​‌ computer graphics in the‌ long term

LiftMeUp

LiftMeUp‌​‌ project on cordis.europa.eu

  • Title:​​
    LiftMeUp: Globally optimal algorithms​​​‌ for dexterous manipulation and‌ locomotion
  • Duration:
    From May‌​‌ 1, 2025 to April​​ 30, 2027
  • Partners:
    • INSTITUT​​​‌ NATIONAL DE RECHERCHE EN‌ INFORMATIQUE ET AUTOMATIQUE (INRIA),‌​‌ France
  • Inria contact:
    Justin​​ Carpentier
  • Coordinator:
  • Summary:

    Robots​​​‌ bear the potential to‌ help solve the world’s‌​‌ pressing problems by enabling​​ and scaling up operations​​​‌ beyond human capacities. To‌ successfully manipulate objects and‌​‌ perform reliable locomotion, robots​​ require adequate models and​​​‌ solvers. Traditionally, physics-based models‌ and iterative solvers are‌​‌ used, and obtaining reliable​​ solutions requires significant effort​​​‌ in model tuning and‌ heuristics for good convergence.‌​‌ LiftMeUp’s objective is to​​ combine data-driven modeling with​​​‌ globally optimal solvers in‌ a unique way to‌​‌ create an easy-to-use framework​​ for the life-long operation​​​‌ of robots in challenging‌ tasks. The result is‌​‌ a transparent, sample-efficient alternative​​ to the less interpretable​​​‌ and resource-hungry deep-learning solutions‌ for robotics. Furthermore, LiftMeUp‌​‌ builds on providing certifiably​​ optimal methods with important​​​‌ consequences for safety and‌ efficiency, as opposed to‌​‌ deep learning and local​​ solvers, where different initializations​​​‌ can lead to entirely‌ different solutions.

    LiftMeUp is‌​‌ carried out at WILLOW,​​ Inria Paris, known for​​​‌ cutting-edge control and locomotion‌ research, and has three‌​‌ stages: first, combining concepts​​ from Koopman theory, polynomial​​​‌ optimization, and kernel methods,‌ lifting functions are inferred‌​‌ from data and integrated​​ into globally optimal methods​​​‌ for state estimation and‌ control. Second, different models‌​‌ are optimally combined, leading​​ to a modular framework​​​‌ that can be incrementally‌ updated online. Lastly, these‌​‌ novel algorithms are implemented​​ on hardware to solve​​​‌ real-world locomotion and dexterous‌ manipulation tasks.

    This framework‌​‌ will have an important​​ scientific impact by creating​​​‌ novel connections between global‌ optimization and machine learning,‌​‌ enabling the use of​​ principled over heuristic solvers​​​‌ in a broad range‌ of applications in robotics‌​‌ and beyond. It will​​ entail energy and time​​​‌ savings for the economy‌ and using sample-efficient and‌​‌ transparent models will democratize​​ technology and build trust.​​​‌

10.3 National initiatives

10.3.1‌ PRAIRIE

Participants: Justin Carpentier‌​‌, Jean Ponce,​​ Cordelia Schmid.

The​​​‌ Prairie Institute (PaRis AI‌ Research InstitutE) is one‌​‌ of the four French​​ Institutes for Interdisciplinary Artificial​​​‌ Intelligence Research (3IA), which‌ were created as part‌​‌ of the national French​​ initiative on AI announced​​​‌ by President Emmanuel Macron‌ on May 29, 2018.‌​‌ It brings together five​​​‌ academic partners (CNRS, Inria,​ Institut Pasteur, PSL University,​‌ and University of Paris)​​ as well as 17​​​‌ industrial partners, large corporations​ which are major players​‌ in AI at the​​ French, European and international​​​‌ levels, as well as​ 45 Chair holders, including​‌ three of the members​​ of WILLOW (Carpentier, Ponce,​​​‌ Schmid). Ponce is the​ scientific director of PRAIRIE.​‌

10.3.2 PR[AI]RIE-PSAI

Participants: Justin​​ Carpentier, Stephane Caron​​​‌, Shizhe Chen,​ Jean Ponce, Cordelia​‌ Schmid.

PR[AI]RIE-PSAI (Paris​​ School of AI) is​​​‌ one of the 9​ French AI-Clusters financed by​‌ France 2030 for a​​ total of 75m€ over​​​‌ 5 years (2024-2029). Created​ in 2019 by CNRS,​‌ Inria, Institut Pasteur, PSL​​ University, Université de Paris​​​‌ Cité, and a club​ of industrial partners, PaRis​‌ Artificial Intelligence Research InstitutE​​ (PR[AI]RIE) was one of​​​‌ the 4 Interdisciplinary Institutes​ for AI research set​‌ up as part of​​ the national strategy for​​​‌ Artificial Intelligence announced by​ the President of the​‌ French Republic in March​​ 2018. Starting in 2024​​​‌ it has evolve to​ become the PR[AI]RIE Paris​‌ School of AI (PR[AI]RIE-PSAI)​​ and cover the triptych​​​‌ of research-training-innovation. PR[AI]RIE-PSAI's activities​ are supported by 125​‌ internationally renowned scientists, specialists​​ in AI with diverse​​​‌ fields of application, such​ as biology, health, physics,​‌ transport or the environment,​​ working in collaboration with​​​‌ public and private actors​ in these sectors. It​‌ includes the five faculties​​ of WILLOW.

10.3.3 VideoPredict:​​​‌ Predicting future video content​

Participants: Cordelia Schmid,​‌ Jean Ponce.

Predicting​​ future video content is​​​‌ a challenging problem with​ high potential impact in​‌ downstream tasks such as​​ self-driving cars and robotics,​​​‌ but also much promise​ for the learning process​‌ itself, from self-supervised learning​​ to data augmentation. Existing​​​‌ approaches range from predicting​ future actions with semantic​‌ labels to creating realistic​​ renderings of future frames.​​​‌ Most of them use​ straight predictions from convolutional​‌ features of previous frames.​​ We propose instead to​​​‌ model the causality effects​ involved in the video​‌ formation process, and disentangle​​ motion and appearance factors.​​​‌ This will result in​ better prediction, but also​‌ and maybe more importantly​​ in a better, more​​​‌ structured understanding of the​ video content, leading to​‌ explicable and interpretable results,​​ and eventually to more​​​‌ trustworthy learning systems. The​ German and French partners​‌ are, respectively, experts in​​ machine learning and computer​​​‌ vision, with complementary research​ threads in causality and​‌ disentangled data models on​​ the one hand, and​​​‌ video understanding and action​ recognition on the other​‌ hand, that are ideally​​ suited for this collaborative​​​‌ project

10.3.4 PEPR Organic​ Robotics

Participants: Justin Carpentier​‌, Stephane Caron,​​ Megane Millan, Umit​​​‌ Bora Gokbakan, Etienne​ Menager.

The PEPR​‌ O2R "Organic Robotics" aims​​ to initiate a change​​​‌ in robotics to create​ a new generation of​‌ robots capable of fluid​​ and natural interactions with​​​‌ users, of social adaptation​ in their interactions, and​‌ which accompanies the technological​​ transitions of societies by​​​‌ producing adapted, responsive and​ reliable services to citizens.​‌ In the frame of​​ this national program, WILLOW​​ is involved in Structuring​​​‌ Action 2 (Robot motion‌ with physical interactions and‌​‌ social adaptation) led by​​ Philippe Souères at LAAS-CNRS,​​​‌ and Structuring Action 4‌ (Modelling, Simulation, Multi-scale, and‌​‌ Biomechanics) led by Jérémie​​ Dequidt at Inria DEFROST.​​​‌ J. Carpentier is also‌ a member of the‌​‌ executive committee of the​​ PEPR.

10.3.5 ARTIFACT (ANR​​​‌ Tremplin): La fabrique du‌ mouvement artificiel

Participants: Justin‌​‌ Carpentier, Megane Millan​​, Franki Nguimatsia Tiofack​​​‌.

Les robots modernes‌ restent confinés dans des‌​‌ environnements étroitement contrôlés et​​ même les chorégraphies complexes​​​‌ que les humanoïdes de‌ Boston Dynamics exécutent sans‌​‌ faille, dépendent fortement de​​ la capture des mouvements,​​​‌ de stratégies de contrôle‌ élaborées à la main‌​‌ et de modèles détaillés​​ de l'espace de travail,​​​‌ qui laissent peu de‌ place à la perception.‌​‌ En clair, les robots​​ sont loin d'atteindre le​​​‌ niveau d'agilité et de‌ dextérité, et encore moins‌​‌ l'autonomie, la robustesse et​​ la sécurité nécessaires à​​​‌ leur déploiement "dans la‌ nature" aux côtés de‌​‌ l'homme. Un bond en​​ avant dans ces capacités​​​‌ est nécessaire pour qu'ils‌ tiennent leurs promesses et‌​‌ sortent véritablement du laboratoire.​​ Notre principe est que​​​‌ la clé de cette‌ révolution est le développement‌​‌ des fondements théoriques et​​ algorithmiques d'une véritable intelligence​​​‌ artificielle du mouvement, une‌ IA qui doit relever‌​‌ le défi supplémentaire d'interagir​​ physiquement avec des environnements​​​‌ en évolution dynamique et,‌ en fin de compte,‌​‌ avec des personnes. Nous​​ romprons avec la dichotomie​​​‌ entre le contrôle optimal,‌ où le rôle de‌​‌ la perception est traditionnellement​​ limité à une étape​​​‌ précoce d'estimation de l'état,‌ et l'apprentissage par renforcement,‌​‌ où les politiques de​​ contrôle sont généralement apprises​​​‌ sans modèle, sans garantie‌ de faire face à‌​‌ la malédiction de la​​ dimensionnalité. Concrètement, nous utiliserons​​​‌ le formalisme de Koopman‌ des systèmes dynamiques complexes‌​‌ pour apprendre les modèles​​ sensorimoteurs et les stratégies​​​‌ de contrôle correspondantes à‌ partir des données des‌​‌ capteurs. Nous développerons des​​ méthodes puissantes pour apprendre,​​​‌ contrôler et partager un‌ dictionnaire de synergies sensorimotrices‌​‌ à travers les tâches,​​ faisant écho à celles​​​‌ utilisées par le système‌ nerveux central humain dans‌​‌ les tâches quotidiennes et​​ accélérant l'acquisition de nouvelles​​​‌ compétences. Nous tirerons parti‌ de la composition des‌​‌ stratégies sensorimotrices et des​​ stratégies de recherche arborescente​​​‌ alimentées par des réseaux‌ neuronaux pour planifier de‌​‌ manière optimale les mouvements​​ du robot sous des​​​‌ contraintes d'observation dynamiques. Le‌ cadre proposé sera mis‌​‌ en œuvre dans de​​ nouvelles architectures logicielles de​​​‌ programmation différentiable et démontré‌ sur plusieurs tâches de‌​‌ locomotion et de manipulation,​​ à la fois en​​​‌ simulation et sur des‌ robots réels.

10.3.6 NIMBLE‌​‌ (ANR JCJC): Inexact optimization​​ for robot control

Participants:​​​‌ Justin Carpentier, Stephane‌ Caron, Etienne Arlaud‌​‌, Jean Ponce,​​ Oumayma Bounou, Joris​​​‌ Vaillant, Wilson Jallet‌, Frederike Dumbgen.‌​‌

The limited agility and​​ dexterity of modern robots​​​‌ prevent them from being‌ deployed outside of laboratories,‌​‌ not even mentioning outside​​ of factories. With NIMBLE,​​​‌ we want to point‌ the classical sense-plan-act design‌​‌ pattern, widely adopted in​​​‌ robotics, as one of​ the main limiting factor.​‌ We propose to replace​​ this three-part control paradigm​​​‌ by learning, from real​ robot experiments, a predictive​‌ model of the robot​​ sensorimotor capabilities. This sensorimotor​​​‌ model will be notably​ exploited to take complex​‌ decisions generalizing to unforeseen​​ situations directly from sensor​​​‌ measurements. While NIMBLE’s innovation​ takes its roots in​‌ the observation of the​​ human motor control organization,​​​‌ it is grounded by​ advanced and principled mathematical​‌ methodologies, in particular, the​​ Koopman operator sitting on​​​‌ top of (deep) learning,​ and exploits our recognized​‌ expertise in robot modeling,​​ optimal control and machine​​​‌ learning for real robots.​ It will notably enable​‌ complex tasks to be​​ defined and executed directly​​​‌ in the sensor space.​ The success of NIMBLE​‌ will be asserted by​​ clear benchmarks in quadrupedal​​​‌ locomotion able to optimally​ adapt to unstructured terrains​‌ and in mobile manipulation​​ for opening unknown doors​​​‌ using the sound combination​ of force and visual​‌ feedback.

10.3.7 INEXACTE (ANR​​ PRCE): Inexact optimization for​​​‌ robot control

Participants: Justin​ Carpentier, Stephane Caron​‌, Etienne Arlaud,​​ Antoine Bambade, Joris​​​‌ Vaillant, Wilson Jallet​.

Robotic systems are​‌ expected to take a​​ large place in tomorrow’s​​​‌ society, far beyond current​ industrial robots in tightly​‌ controlled factory environments, with​​ large impacts in terms​​​‌ of safety, health at​ work, comfort and productivity.​‌ The motion of robots​​ is typically designed and​​​‌ controlled by specifying numerical​ objectives and constraints on​‌ what they must do,​​ and within which limits.​​​‌ These specifications often conflict,​ and the actual control​‌ must be computed to​​ satisfy all of them​​​‌ in the best possible​ way. This is naturally​‌ achieved by solving a​​ numerical optimization problem. Such​​​‌ problems are often small​ enough in robotics that​‌ they can be solved​​ exactly in theory, but​​​‌ they are always based​ on models, and by​‌ definition, models reflect reality​​ imperfectly, even more so​​​‌ as we get away​ from tightly controlled (factory)​‌ environments.

We propose a​​ complete change of paradigm,​​​‌ to acknowledge that we​ actually solve inaccurate optimization​‌ problems that provide inaccurate​​ solutions by construction, and​​​‌ explore the following two​ hypotheses: (H1) We can​‌ obtain the exact same​​ performance with imprecise numerical​​​‌ solutions, (H2) we can​ obtain these imprecise numerical​‌ solutions using less costly​​ numerical methods, which can​​​‌ be computed faster, using​ less demanding hardware. To​‌ the best of our​​ knowledge, these questions have​​​‌ barely been explored and​ INEXACT will provide the​‌ first comprehensive exploration of​​ this topic.

Our short-term​​​‌ ambition is to significantly​ lower the computational requirements​‌ for solving control problems,​​ taking advantage of the​​​‌ imprecisions inherent to robotics​ control to compute appropriate​‌ solutions faster. But ultimately,​​ our long-term ambition is​​​‌ to design less fragile,​ less expensive and less​‌ polluting robots, since being​​ less dependent on precise​​​‌ models can make us​ less dependent on precise​‌ and therefore complex, fragile,​​ expensive and resource-demanding mechatronics.​​​‌

10.3.8 3D-GEM (ANR JCJC):​ Learning Generalizable 3D-based Robotic​‌ Manipulation Policies

Participants: Shizhe​​ Chen, Justin Carpentier​​, Stephane Caron,​​​‌ Cordelia Schmid, Jean‌ Ponce.

Robotic manipulation‌​‌ in unstructured environments is​​ a long-term goal, with​​​‌ the potential for significant‌ societal and economic impacts‌​‌ such as in manufacturing​​ and healthcare. However, current​​​‌ approaches suffer from significant‌ limitations in generalization to‌​‌ novel environments, objects and​​ tasks, which are essential​​​‌ for real-world applications. Most‌ learning-based methods are trained‌​‌ and evaluated on a​​ narrow range of tasks​​​‌ - typically basic pick-and-place‌ skills, and focus on‌​‌ 2D images, lacking crucial​​ 3D understanding. The 3D-GEM​​​‌ project aims to develop‌ cutting-edge robotic manipulation systems‌​‌ by leveraging recent breakthroughs​​ in artificial intelligence, particularly​​​‌ large language models and‌ vision foundation models, to‌​‌ build 3D-based robotic manipulation​​ foundation models. This initiative​​​‌ will establish a modular‌ framework to tackle critical‌​‌ challenges, including data scarcity,​​ generalization, dexterity, and efficiency.​​​‌ The project involves three‌ key thrusts: (1) significantly‌​‌ enhancing the scale and​​ quality of robot datasets;​​​‌ (2) advancing 3D embodied‌ perception and task planning‌​‌ for comprehending complex 3D​​ scenes and generating high-level​​​‌ grounded plans; (3) learning‌ generalist 3D motion planning‌​‌ policies using multimodal sensors​​ and model predictive control.​​​‌ These high-level and low-level‌ modules will function in‌​‌ a closed-loop system to​​ enable efficient task execution​​​‌ across diverse scenarios, ultimately‌ improving the versatility and‌​‌ effectiveness of robotic systems.​​

10.4 Regional initiatives

10.4.1​​​‌ AI4IDF

Participants: Justin Carpentier‌, Jean Ponce,‌​‌ Etienne Arlaud, Joris​​ Vaillant, Pierre-Guillaume Raverdy​​​‌.

Île-de-France is home‌ to the world's largest‌​‌ mathematics community, several of​​ France's largest computer science​​​‌ laboratories, but also a‌ dense industrial fabric in‌​‌ artificial intelligence.

In this​​ extremely rich context, the​​​‌ four main Artificial Intelligence‌ (AI) institutes - DATAIA,‌​‌ Hi! PARIS, PRAIRIE and​​ SCAI - propose to​​​‌ create an alliance to‌ structure and animate the‌​‌ community, and to offer​​ industrial and international partners​​​‌ a unified vision of‌ the exceptional forces at‌​‌ work.

11 Dissemination

Participants:​​ Ajay Sathya, Cordelia​​​‌ Schmid, Elliot Vincent‌, Etienne Menager,‌​‌ Etienne Arlaud, Francois​​ Garderes, Fabian Schramm​​​‌, Frederike Dumbgen,‌ Franki Nguimatsia Tiofack,‌​‌ Gabriel Fiastre, Louis​​ Montaut, Justin Carpentier​​​‌, Jean Ponce,‌ Shizhe Chen, Shiyao‌​‌ Li, Sara Pieri​​, Theotime Le Hellard​​​‌, Matthieu Futeral-Peter,‌ Paul Pacaud, Roland‌​‌ Andrews, Thomas Chabal​​, Valentin Tordjman–Levavasseur,​​​‌ Wilson Jallet, Umit‌ Bora Gokbakan, Zeeshan‌​‌ Khan.

11.1 Promoting​​ scientific activities

11.1.1 Scientific​​​‌ events: organisation

Member of‌ the organizing committees
  • Ellis‌​‌ Workshop on Comp. Vision​​ & Machine Learning, BadTeinach,​​​‌ April 2025 (C. Schmid)‌
  • CVPR 2025 workshop on‌​‌ Generalization in Robotics Manipulation​​ Workshop and Challenges, Nashville,​​​‌ June 2025 (S. Chen,‌ C. Schmid)
  • Video AI‌​‌ Symposium, Paris, September 2025​​ (C. Schmid)
  • CoRL 2025​​​‌ workshop on Open-Source Hardware‌ in the Era of‌​‌ Robot Learning, Seoul, September​​ 2025 (S. Caron)
  • Workshop​​​‌ on Diverse Optimization and‌ Exploration, Paris, November 2025‌​‌ (J. Carpentier)

11.1.2 Scientific​​ events: selection

Chair of​​​‌ conference program committees
  • IEEE/CVF‌ Computer Vision and Pattern‌​‌ Recognition Conference (CVPR) (J.​​​‌ Ponce, C. Schmid, S.​ Chen)
  • International Conference on​‌ Computer Vision (ICCV) (J.​​ Ponce, C. Schmid)
  • European​​​‌ Conference on Computer Vision​ (ECCV) (C. Schmid, S.​‌ Chen)
  • International Conference on​​ Machine Learning (ICML) (C.​​​‌ Schmid, S. Chen)
  • International​ Conference on Learning Representations​‌ (ICLR) (C. Schmid, S.​​ Chen)
  • Association for the​​​‌ Advancement of Artificial Intelligence​ (AAAI) (S. Chen)
  • RSS​‌ Pioneers (A. Sathya, F.​​ Dumbgen)
Member of the​​​‌ conference program committees
  • Associate​ editor for the Humanoids​‌ 2025 conference (S. Caron,​​ A. Sathya)
  • Associate editor​​​‌ for the IROS 2025​ conference (A. Sathya)
  • Associate​‌ editor for the ICRA​​ 2026 conference (A. Sathya)​​​‌
Reviewer
  • IEEE-RAS International Conference​ on Robotics and Automation​‌ (ICRA) (S. Caron, Ü.​​ B. Gökbakan, F. N.​​​‌ Tiofack, T. L. Hellard,​ F. Schramm, A. Sathya,​‌ S. Chen, T. Chabal)​​
  • IEEE-RAS International Conference on​​​‌ Humanoid Robots (Humanoids) (S.​ Caron)
  • IEEE/RSJ International Conference​‌ on Intelligent Robots and​​ System (IROS) (S. Caron,​​​‌ Ü. B. Gökbakan, F.​ Schramm, S. Li, L.​‌ Montaut, S. Chen)
  • International​​ Conference on Learning Representations​​​‌ (ICLR) (T. L. Hellard,​ S. Pieri)
  • Annual Meeting​‌ of the Association for​​ Computational Linguistics (ACL) (S.​​​‌ Pieri)
  • Conference on Neural​ Information Processing Systems (NeurIPS)​‌ (S. Pieri, Z. Khan)​​
  • IEEE / CVF Computer​​​‌ Vision and Pattern Recognition​ Conference (CVPR) (S. Pieri,​‌ S. Li, Z. Khan)​​
  • International Conference on 3D​​​‌ Vision (3DV) (S. Li)​
  • International Conference on Machine​‌ Learning (ICML) (Z. Khan)​​
  • Robotics: Science and Systems​​​‌ (RSS) (L. Montaut)
  • IEEE​ International Conference on Automation​‌ Science and Engineering (CASE)​​ (A. Sathya)
  • International Conference​​​‌ on Computational Linguistics (COLING)​ (S. Chen)
  • Conference on​‌ Language Modeling (COLM) (S.​​ Chen)
  • Conference on Robot​​​‌ Learning (CoRL) (S. Chen)​
  • International Conference on Computer​‌ Vision (ICCV) (S. Chen,​​ T. Chabal)

11.1.3 Journal​​​‌

Member of the editorial​ boards
  • Associate editor, IEEE​‌ Transactions on Pattern Analysis​​ and Machine Intelligence (TPAMI)​​​‌ (S. Chen)
  • Associate editor,​ IEEE Transactions on Robotics​‌ (TRO) (J. Carpentier)
  • Associate​​ editor, IEEE Robotics and​​​‌ Automation Letters (RAL) (J.​ Carpentier)
Reviewer - reviewing​‌ activities
  • IEEE Transactions on​​ Robotics (T-RO) (J. Carpentier,​​​‌ S. Caron, L. Montaut,​ A. Sathya, S. Chen)​‌
  • IEEE Robotics and Automation​​ Letters (RA-L) (S. Caron,​​​‌ Ü. B. Gökbakan, F.​ Schramm, E. Menager, L.​‌ Montaut, A. Sathya, S.​​ Chen, T. Chabal)
  • International​​​‌ Journal of Robotics Research​ (IJRR) (E. Menager, A.​‌ Sathya)
  • IEEE-RAS International Conference​​ on Soft Robotics (E.​​​‌ Menager)
  • IEEE Transactions on​ Automation Science and Engineering​‌ (T-ASE) (A. Sathya)
  • International​​ Journal of Computer Vision​​​‌ (IJCV) (S. Chen)

11.1.4​ Invited talks

  • J. Carpentier,​‌ Conseil Scientifique d'Inria, Paris,​​ Décembre 2025.
  • J. Carpentier,​​​‌ Workshop on Diverse Optimization​ for Robotics, Paris, November​‌ 2025.
  • J. Carpentier, 4th​​ International Workshop on AI​​​‌ for Robotics, NAVER LABS​ Europe, Grenoble, November 2025.​‌
  • J. Carpentier, X-IA event,​​ Paris, November 2025.
  • J.​​​‌ Carpentier, Table ronde IA​ et Robotique, Cap Digital,​‌ Paris, November 2025.IEEE-RAS Polish​​ Chapter
  • J. Carpentier, IEEE-RAS​​​‌ Polish Chapter, Remote, September​ 2025.
  • J. Carpentier, PAISS​‌ Summer School, Grenoble, September​​ 2025.
  • J. Carpentier, Table​​​‌ ronde Robotique et IA​ VivaTech, Paris, June 2025.​‌
  • J. Carpentier, RobotSoft workshop,​​ Lausane, April 2025.
  • J.​​ Carpentier, AI Summit, Février​​​‌ 2025.
  • C. Schmid, 4th‌ International Workshop on AI‌​‌ for Robotics, NAVER LABS​​ Europe, Grenoble, November 2025.​​​‌
  • C. Schmid, Workshop on‌ Multimodal Representation and Retrieval,‌​‌ in conjunction with ICCV’25,​​ October 2025.
  • C. Schmid,​​​‌ Festvortrag (ceremonial lecture) at‌ Leopoldina annual meeting, Halle,‌​‌ September 2025.
  • C. Schmid,​​ Invited speaker at Video​​​‌ AI Symposium, Paris, September‌ 2025.
  • C. Schmid, Keynote‌​‌ at Building Bridge Conference,​​ Dresden, 2025.
  • C. Schmid,​​​‌ Invited speaker at Workshop‌ on Pixel-level Vision Foundation‌​‌ Models, in conjunction with​​ CVPR’25, June 2025.
  • C.​​​‌ Schmid, Invited speaker at‌ Workshop on Multimodal Algorithmic‌​‌ Reasoning, in conjunction with​​ CVPR’25, June 2025.
  • C.​​​‌ Schmid, Invited speaker at‌ Workshop on ScanNet++ Novel‌​‌ View Synthesis and 3D​​ Semantic Understanding, in conjunction​​​‌ with CVPR’25, June 2025.‌
  • C. Schmid, Invited speaker‌​‌ at Workshop on Video​​ Large Language Models, in​​​‌ conjunction with CVPR’25, June‌ 2025.
  • C. Schmid, Invited‌​‌ speaker at Workshop on​​ Computer Vision in the​​​‌ Wild, in conjunction with‌ CVPR’25, June 2025.
  • C.‌​‌ Schmid, Seminar at Applied​​ and Theoretical Aspects of​​​‌ Robot Intelligence Lab, TUM,‌ Munich, December 2025.
  • C.‌​‌ Schmid, Presentation at Workshop​​ on Diverse Optimization and​​​‌ Exploration, Inria, Paris, November‌ 2025.
  • C. Schmid, Presentation‌​‌ at Malik Fest, University​​ of California, Berkeley, October​​​‌ 2025.
  • C. Schmid, Presentation‌ at Ellis workshop, Bad‌​‌ Teinach, April 2025.
  • J.​​ Ponce, Forum on AI​​​‌ Frontiers 2025, Seoul, Korea,‌ October 27.
  • S. Caron,‌​‌ CoRL 2025 workshop on​​ Open-Source Hardware in the​​​‌ Era of Robot Learning,‌ September 2025.
  • S. Caron,‌​‌ Harada Lab, The University​​ of Osaka, Japan, October​​​‌ 2025.
  • S. Caron, LAAS-CNRS,‌ October 2025.
  • A. Sathya,‌​‌ Robotics, Optimization, and Assistive​​ Mobility (ROAM) Lab at​​​‌ University of Notre Dame,‌ USA, July 2025.
  • A.‌​‌ Sathya, EEE Department, IIT-Guwahati,​​ Guwahati, India, August 2025.​​​‌
  • A. Sathya, Gepetto Team,‌ LAAS-CNRS, Toulouse, October 2025.‌​‌
  • S. Chen, GDR IASIS​​ workshop on Deformable Object​​​‌ Modelling Trends: from Perception‌ to Applications, Paris, April‌​‌ 2025.
  • S. Chen, Invited​​ speaker at Workshop on​​​‌ Computer Vision in the‌ Wild, in conjunction with‌​‌ CVPR’25, June 2025.
  • S.​​ Chen, Demi-heure de science,​​​‌ Inria Paris, July 2025.‌
  • S. Chen, Korea University,‌​‌ October 2025.
  • S. Chen,​​ THOTH team, Inria Grenoble,​​​‌ November 2025.
  • S. Chen,‌ Workshop on Foundation Models‌​‌ in Robotics, Lyon, November​​ 2025.

11.2 Teaching -​​​‌ Supervision - Juries -‌ Educational and pedagogical outreach‌​‌

11.2.1 Teaching

  • Course at​​ Master MVA: Robotics (S.​​​‌ Caron, J. Carpentier, A.‌ Sathya as teaching assistant)‌​‌
  • Course at ENS-PSL: Planification​​ de mouvement en robotique​​​‌ et en animation graphique‌ (S. Caron)
  • Course at‌​‌ Dauphine-PSL: Formation Chef de​​ projet IA (S. Caron)​​​‌
  • Lecture at Mines Paris-PSL‌ (formation MAREVA): Reinforcement learning‌​‌ for robotics (S. Caron)​​
  • Introduction to computer vision,​​​‌ NYU, Spring 2025 (J.‌ Ponce)
  • Introduction to computer‌​‌ vision, ENS-PSL, Fall 2025​​ (J. Ponce, G. Fiastre​​​‌ as teaching assistant)
  • Computer‌ vision, Chefs de projets‌​‌ IA, Société des Ingénieurs​​ de l'Automobile, June 2025​​​‌ (J. Ponce)
  • Computer vision,‌ Casino COMEX, We Are‌​‌ School, May 2025 (J.​​ Ponce)
  • Machine Learning and​​​‌ Applications (MALAP), École Nationale‌ des Ponts et Chaussées,‌​‌ IP Paris (S. Li​​​‌ as teaching assistant)
  • Deep​ Reinforcement Learning lecture at​‌ Master MVA: Deep Learning​​ (S. Chen)

11.2.2 Supervision​​​‌

PhD defenses
  • Zerui Chen,​ advised by C. Schmid​‌ and S. Chen.
  • Ricardo​​ Garcia-Pinel, advised by C.​​​‌ Schmid and S. Chen.​
  • Thomas Chabal, advised by​‌ J. Ponce, C. Schmid​​ and S. Chen.
  • Matthieu​​​‌ Futeral-Peter, advised by R.​ Bawden (Inria ALMAnaCH), B.​‌ Sagot (Inria ALMAnaCH) and​​ C. Schmid.
  • Theo Bodrito,​​​‌ advised by J. Ponce​ and J. Mairal (Inria​‌ Grenoble).
  • Elliot Vincent, advised​​ by J. Ponce and​​​‌ M. Aubry (ENPC).
PhD​ students
  • Théotime Le Hellard,​‌ started in Oct 2025,​​ advised by J. Carpentier.​​​‌
  • Franki Nguimatsia Tiofack, started​ in Jan 2025, advised​‌ by J. Carpentier.
  • Roland​​ Andrews, started in Oct​​​‌ 2024, advised by J.​ Carpentier and A. Taylor​‌ (Sierra).
  • Imen Mahdi (University​​ of Freiburg), started in​​​‌ Oct 2024, advised by​ C. Schmid and Abhinav​‌ Valada (University of Freiburg).​​
  • Romain Seailles, started in​​​‌ Sept 2024, advised by​ J. Ponce and J.​‌ Mairal (Inria Grenoble).
  • Basile​​ Terver (Meta), started in​​​‌ Nov 2024, advised by​ J. Ponce and Y.​‌ LeCun (Meta).
  • Shiyao Li​​ (ENPC), started in Oct​​​‌ 2024, advided by S.​ Chen and V. Lepetit​‌ (ENPC).
  • Lucas Ventura, started​​ in Oct 2022, advised​​​‌ by C. Schmid and​ G. Varol (ENPC).
  • Fabian​‌ Schramm, started in Feb​​ 2023, advised by J.​​​‌ Carpentier and N. Perrin-Gilbert​ (ISIR).
  • Zeeshan Khan, started​‌ in Sept 2023, advised​​ by S. Chen and​​​‌ C. Schmid.
  • Gabriel Fiastre,​ started in Sept 2023,​‌ advised by C. Schmid.​​
  • Ludovic de Matteis, started​​​‌ in Oct 2023, advised​ by J. Carpentier and​‌ N. Mansard (CNRS).
  • U.​​ Bora Gökbakan, started in​​​‌ June 2024, advised by​ S. Caron and P.​‌ Souères (CNRS).
  • S. Pieri,​​ started in Oct 2024,​​​‌ advised by S. Chen,​ C. Schmid and J.​‌ Sivic (Czech Technical University).​​
  • P. Pacaud, started in​​​‌ Oct 2024, advised by​ S. Chen and C.​‌ Schmid.
  • F. Gardères, started​​ in May 2023, advised​​​‌ by S. Chen and​ J. Ponce.
  • F. Porcher,​‌ started in May 2025,​​ advised by N. Carion​​​‌ (Meta), K. Alahari (Inria​ Grenoble) and S. Chen.​‌

11.2.3 Juries

  • PhD committee​​ of Marc Duclusaud, University​​​‌ of Bordeaux, France, December​ 2025 (S. Caron)
  • CRCN​‌ recruitment jury for Inria​​ Nancy, April 2025 (S.​​​‌ Caron)
  • Timothée Darcet, Université​ Grenoble Alpes, July 2025​‌ (C. Schmid)
  • Théo Cachet,​​ Sorbonne université, Paris, June​​​‌ 2025 (C. Schmid)
  • Kumar​ Ashutosh, prelim UT Austin,​‌ May 2025 (C. Schmid)​​
  • Corentin Sautier, ENPC, October​​​‌ 2025 (S. Chen)
  • Ivan​ Lopes, Mines Paris –​‌ PSL, October 2025 (S.​​ Chen)
  • Smail Ait Bouhsain,​​​‌ LAAS-CNRS, April 2025 (J.​ Carpentier)

11.3 Popularization

11.3.1​‌ Specific official responsibilities in​​ science outreach structures

  • "Carte​​​‌ blanche on AI" in​ Le Monde newspaper with​‌ Isabelle Ryl, 6 times​​ a year (J. Ponce)​​​‌
  • Interview by Micode on​ Underscore_ on available on​‌ Youtube over 310 000​​ views (J. Ponce)

11.3.2​​​‌ Productions (articles, videos, podcasts,​ serious games, ...)

  • Popular​‌ science video for the​​ Algorea platform (reach:  10k​​​‌ teachers,  1M students) in​ collaboration with France-IO (S.​‌ Caron)

12 Scientific production​​

12.1 Major publications

12.2 Publications​​​‌ of the year

International‌ journals

National journals

  • 17 article​‌A.Armand Jordana,​​ S.Sébastien Kleff,​​​‌ A.Arthur Haffemayer,​ J.Joaquim Ortiz-Haro,​‌ J.Justin Carpentier,​​ N.Nicolas Mansard and​​​‌ L.Ludovic Righetti.​ Infinite-Horizon Value Function Approximation​‌ for Model Predictive Control​​.IEEE Robotics and​​​‌ Automation LettersJune 2025​HALDOIback to​‌ text

International peer-reviewed conferences​​

Conferences without proceedings

Edition (books,​ proceedings, special issue of​‌ a journal)

  • 36 proceedings​​Optimal transport unlocks end-to-end​​​‌ learning for single-molecule localization​.ICLR 2026: The​‌ Fourteenth International Conference on​​ Learning RepresentationsRio de​​​‌ Janeiro, Brazil2026.​ In press. HALback​‌ to text

Doctoral dissertations​​ and habilitation theses

  • 37​​​‌ thesisT.Théo Bodrito​. Deep learning for​‌ exoplanet detection in high​​ contrast imaging.Ecole​​​‌ Normale Supérieure - PSL​June 2025HALback​‌ to text
  • 38 thesis​​T.Thomas Chabal.​​​‌ Object-centric representations for sensing​ and planning in visually-guided​‌ robotics.Ecole normale​​ superieureSeptember 2025HAL​​​‌back to text
  • 39​ thesisZ.Zerui Chen​‌. Learning dexterous manipulation​​ from 3D hand and​​​‌ object interaction.Université​ Paris sciences et lettres​‌May 2025HALback​​ to text
  • 40 thesis​​​‌R.Ricardo Garcia-Pinel.​ Learning Visuomotor Policies for​‌ Robotic Manipulation.Ecole​​ normale Supérieur - PSL​​​‌June 2025HALback​ to text
  • 41 thesis​‌E.Elliot Vincent.​​ Analysis of satellite image​​​‌ time series for classification​ and change detection.​‌École des Ponts ParisTech​​May 2025HALback​​​‌ to text

Reports &​ preprints

Other scientific publications