2025Activity reportProject-TeamSTARS
RNSR: 201221015V- Research centerInria Centre at Université Côte d'Azur
- Team name: Spatio-Temporal Activity Recognition of Social interactions
Creation of the Project-Team: 2024 November 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A9. Artificial intelligence
- A9.1. Knowledge
- A9.2. Machine learning
- A9.3. Signal processing
- A9.8. Reasoning
- A9.12. Computer vision
Other Research Topics and Application Domains
- B2.1. Well being
- B2.4. Therapies
1 Team members, visitors, external collaborators
Research Scientists
- François Brémond [Team leader, INRIA, Senior Researcher, HDR]
- Michal Balazia [UNIV COTE D'AZUR, ISFP]
- Antitza Dantcheva [INRIA, Senior Researcher, HDR]
- Monique Thonnat [INRIA, Emeritus, HDR]
Post-Doctoral Fellows
- Baptiste Chopin [INRIA, Post-Doctoral Fellow, until Aug 2025]
- Olivier Huynh [INRIA, Post-Doctoral Fellow, until Feb 2025]
PhD Students
- Tanay Agrawal [INRIA, until Oct 2025]
- Yuan Gao [INRAE, until Dec 2026]
- Snehashis Majhi [INRIA, until Mar 2026]
- Nabyl Quignon [INRIA, from Mar 2025]
- Aglind Reka [UNIV COTE D'AZUR, from Sep 2025]
- Tomasz Stanczyk [INRIA, until May 2026]
- Valeriya Strizhkova [INRIA, until Feb 2025]
- Charbel Yahchouchi [Probayes, CIFRE, from Jul 2025]
- Seongro Yoon [INRIA, until Oct 2027]
Technical Staff
- Mahmoud Ali [INRIA, Engineer]
- Marios Kaplanis [INRIA, Engineer, from Apr 2025 until Jun 2025]
- Aowen Shi [INRIA, Engineer, from Feb 2025]
- Yoann Torrado [INRIA, Engineer, until May 2025]
Interns and Apprentices
- Hardik Agarwal [INRIA, Intern, from Apr 2025 until Jul 2025]
- Akshaya Ananda Murthy [INRIA, Intern, from Apr 2025 until Aug 2025]
- Andranik Arakelov [INRIA, Intern, from Jun 2025 until Oct 2025]
- Saurabh Atreya [INRIA, Intern, until May 2025]
- Aaryan Dhawan [UNIV COTE D'AZUR, Intern, from Apr 2025 until Sep 2025]
- Anil Egin [INRIA, Intern, until Apr 2025]
- Utkarsh Gupta [INRIA, Intern, from Apr 2025 until Jul 2025]
- Khodor Hamadi [INRIA, Intern, from Jun 2025 until Nov 2025]
- Mingyun Jeong [INRIA, Intern, from Jul 2025]
- Jang Hyun Kim [INRIA, Intern, from Jul 2025 until Sep 2025]
- Dian-Wei Lai [INRIA, Intern, from Oct 2025]
- Quentin Merilleau [INRIA, from Feb 2025]
- Nishit Poddar [INRIA, Intern, from Apr 2025 until Jun 2025]
- Nabyl Quignon [INRIA, until Feb 2025]
- Aglind Reka [INRIA, until Aug 2025]
- Miriana Russo [INRIA, Intern, from Oct 2025]
- Ananya Sharma [INRIA, from Sep 2025]
- Utkarsh Tiwari [Inria, until Apr 2025]
Administrative Assistant
- Marie-Cecile Lafont [INRIA]
Visiting Scientists
- Seungryul Baek [Ulsan National Institute of Science and Technology (UNIST), Republic of Korea, from Jul 2025 until Aug 2025]
- Donghyeon Cho [Hanyang University, Seoul, Republic of Korea, from Dec 2025]
- Nesli Ergdogmus [Izmir Institute of Technology, Turkey, from Jun 2025 until Jul 2025]
- Salvatore Fiorilla [ University of Bologna, Italy, until Feb 2025]
- Eric Granger [ETS MONTREAL, from Dec 2025]
- Jinsun Park [Pusan National University, Busan, Republic of Korea, from Jul 2025 until Jul 2025]
- Teimuraz Saginadze [MICM, Georgian Technical University, GTU, Tbilisi, Georgia, from Feb 2025 until Apr 2025]
External Collaborators
- Abid Ali [INRIA, then University of Luxembourg, until Jul 2025]
- Laura Ferrari [Scuola Superiore Sant'Anna, Pisa, Italy]
- Rachid Guerchouche [CoBTek]
- Alexandra Konig [CoBTek, CHU NICE]
- Benoit Lagadec [FairVision, from Jul 2025]
- Hali Lindsay [KIT - ALLEMAGNE, until Jun 2025]
- Sabine Moisan [retired, HDR]
- Jean Rigault [retired]
- Philippe Robert [CoBTek]
- Yaohui Wang [Shangai AI Lab]
- Di Yang [Shangai University, until Aug 2025]
2 Overall objectives
2.1 Presentation
The STARS research project-team focuses on the design of computer vision methods for real-time understanding of social interactions observed by sensors. Our objective is to propose new algorithms to analyze the behavior of people suffering from behavioral disorders, in order to improve their quality of life. We study long-term spatio-temporal interactions performed by humans in their natural environment. We address this challenge by proposing novel deep-learning architectures to model behavioral traits such as facial expression, gaze, gestures, body behavior, and body language. To cope with the limited amount of available data and the privacy issues of medical data, we propose data generation for data augmentation and anonymization. Another important challenge is to make the link between collected data, medical diagnosis, and ultimately treatments. To validate our research we work closely with our clinical partners, in particular those of the Nice Hospital.
2.2 Motivation
Deep learning techniques are highly successful for simple action recognition, nevertheless several important challenges remain in activity recognition in general and specifically for our target medical application domain.
To validate our research, we work closely with our clinical partners. We have a strategic partnership named CoBTek with the clinicians of Nice Hospital (CHU Nice) to study the impact of video understanding approaches for cognitive disorders. This partnership started in January 2012 and has evolved to a University Côte d'Azur team and joint work with monthly regular meetings between STARS and the clinicians of Institut Claude Pompidou (ICP), Lenval, and Pasteur hospitals. The two directors of CoBTek are François Brémond and Florence Askenazy (PU-PH) at Lenval. Our objective to deepen research in social interaction is motivated by the needs of our clinician partners. A typical use-case of social interactions observed by sensors appears in the clinical assessments of psychiatric patients, such as people suffering from conditions like major depression, bipolar disorder, or schizophrenia 25. In these clinical assessments, interactions between the patient and the clinician are recorded with multi-modalities, i.e., with video, audio, and physiological sensors. The goal is to extract digital markers (defined by formal interaction models), which are indicators that can characterize a digital phenotype. The digital markers are automatically extracted from the recorded data and the digital phenotypes could lead to a treatment improving the patient's behavioral disorders.
Social interaction as a new study target.
An abundance of valuable diagnostic relevant information is extracted from the interaction between clinician and patient. This clinical interaction (e.g., conversation between patient and clinician including verbal and nonverbal communication) is traditionally the clinician’s most important source of information about patients’ social skills, mood, and motivation levels. However, a comprehensive clinical interview requires sufficient consultation time as well as strong clinical competencies and expertise to be able to detect early subtle signs of changes in communication. Moreover, for detecting these changes during a clinical conversation, no standardized objective measures exist, leaving a lot of room for speculation and subjective biases. Introducing methodologies to assess in a quantitative manner behavioral dynamics during real-life social interaction could help indicate, for instance, the level of reciprocity and therapeutic alliance, which until now is merely left to clinical intuition as we have pointed out 25.
Need for precise and sensitive digital markers.
To develop and test new measures of mental illness, a movement from traditional markers and phenotyping to digital markers and digital phenotyping is needed. "Digital phenotyping" refers to the moment-to-moment quantification of human behavior in everyday life using data from digital devices. Digital phenotyping suggests collecting patient data allowing for non-intrusive and continuous monitoring of behavioral and mental states, ultimately revealing clinically relevant information. Similarly, `digital markers' (e.g., frequency of eye contact) are digitally-obtained disease indicators that can be used to define a digital phenotype (e.g., eye gaze). Interaction-based phenotyping could provide various additional data to generate an observer-independent assessment of behavior during a social interaction which reflects as a mirror the current symptomology of a patient. Additionally, interaction-based measures such as social synchrony may have predictive value for treatment outcomes. Recent progress in computer vision, speech processing, and machine learning has enabled detailed and objective characterization of human interaction behavior 8. Applying these advanced methods of artificial intelligence provides new opportunities to identify digital markers of patient behavior. Such markers have the potential to provide objective and continuous assessments of symptomatology in the context of patients' daily lives 30, 4, thereby allowing to precisely tailor treatment to the concrete patient trajectory. So far, many developed techniques are based solely on verbal information during interviews; however interpersonal communication often occurs non-verbally. Thus, merging computer vision-based measurement in a multi-modal approach would enhance the quality of analysis by allowing the detection of changes in the quality of communication as alterations in the dyadic interaction patterns.
Digital markers and methods.
In recent years, behavior recognition methods based on artificial intelligence (i.e., machine or deep learning) have become increasingly effective in a variety of tasks, including action classification 19, body language and gestures 6, gaze estimation 26, eye contact detection, facial action units, facial expression 27, as well as affect extracted from single or multiple modalities 2. A growing number of approaches make use of this progress in human behavior sensing to analyze clinical interaction data (e.g., therapy sessions), linguistic and paralinguistic characteristics from speech. As psychiatric disorders (depression, bipolar, schizophrenia) impact the quality of social interactions, there is an emphasis on studying these quantifiable behavioral dynamics in real-life social interaction at the dyadic level rather than solely individual behavior 25. While these initial results are promising, this research needs to be accelerated by further development of digital phenotyping technology focusing on scalability and equity, by establishing shared longitudinal data repositories and by fostering multidisciplinary collaborations between clinical stakeholders, including patients, computer scientists, and medical researchers.
Sensors for analyzing human interactions.
We are planning to keep using mainly RGB (i.e. Red, Green and Blue colors) monocular cameras for video analysis. These off-the-shelf sensors are affordable, and very precise with a large dynamic range and high resolution. They are easily deployable in elderly homes and in hospitals. However, we also investigate new types of sensors (e.g. RGB-D, i.e. RGB colors and Depth, and infrared cameras, physiological sensors, and microphones) to capture complementary information and depending on the use-cases. These new sensors can open up new avenues of research. As we do not want to disturb the everyday activities of the end-users, we can first train our models with a large variety of sensors in dedicated locations, such as laboratories. Second, we can distill the learned weights into lighter models trained only with RGB video streams. These lighter RGB models are more convenient and less intrusive, as they can be processed only using standard RGB cameras. Third, we can use only these lighter RGB models at run-time in embedded devices directly at the end-users' locations. Therefore, we only use the sensors and cameras pertinent to the end-users.
2.3 Social interaction understanding: a challenging task
The major challenge in semantic interpretation of dynamic scenes is to bridge the gap between the task dependent interpretation of data and the flood of measures provided by sensors. The problems we address range from physical object detection, activity understanding, activity learning to vision system design and evaluation. The two principal classes of human activities we focus on are assistance to older adults and video analytics.
Typical examples of complex activity are shown in Figure 1 and Figure 2 for a homecare application (See Toyota Smarthome Dataset here). In this example, the duration of the monitoring of an older person apartment could last several months. The activities involve interactions between the observed person and several pieces of equipment. The application goal is to recognize the everyday activities at home through formal activity models and data captured by a network of sensors embedded in the apartment. Here typical services include an objective assessment of the frailty level of the observed person to be able to provide a more personalized care and to monitor the effectiveness of a prescribed therapy. The assessment of the frailty level is performed by an Activity Recognition System which transmits a textual report (containing only meta-data) to the general practitioner who follows the older person. Thanks to the recognized activities, the quality of life of the observed people can thus be improved and their personal information can be preserved.
The image presents a study of activity patterns in different environments: the dining room, kitchen, and living room, marked as C1-C7. It includes pie charts showing distribution of activity duration (short, medium, long) and temporal variability (low, medium, high). There are bar graphs displaying the frequency of various activities and their respective environments. The bottom section uses colors to represent where and how often each activity occurs. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
Homecare monitoring: the annotation of a composed activity "Cook", captured by a video camera
The ultimate goal is for cognitive systems to perceive and understand their environment to be able to provide appropriate services to a potential user. An important step is to propose a computational representation of people activities to adapt these services to them. Up to now, the most effective sensors have been video cameras due to the rich information they can provide on the observed environment. These sensors are currently perceived as intrusive ones. A key issue is to capture the pertinent raw data for adapting the services to the people while preserving their privacy. We study different solutions including of course the local processing of the data without transmission of images and the utilization of new compact sensors developed for interaction (also called RGB-Depth sensors, an example being the Kinect) or networks of small non-visual sensors.
2.4 International and Industrial Cooperation
Our work has been applied in the context of more than 10 European projects such as COFRIEND, ADVISOR, SERKET, CARETAKER, VANAHEIM, SUPPORT, DEM@CARE, VICOMO, EIT Health.
We had or have industrial collaborations in several domains: transportation (CCI Airport Toulouse Blagnac, SNCF, Inrets, Alstom, Ratp, Toyota, GTT (Italy), banking (Crédit Agricole Bank Corporation, Eurotelis and Ciel), security (Thales R&T FR, Thales Security Syst, EADS, Sagem, Bertin, Alcatel, Keeneo), multimedia (Thales Communications), civil engineering (Centre Scientifique et Technique du Bâtiment (CSTB)), computer industry (BULL), software industry (AKKA), hardware industry (ST-Microelectronics) and health industry (Philips, Link Care Services, Vistek).
We have international cooperations with research centers such as Reading University (UK), Idiap (Switzerland), Multitel (Belgium), National Cheng Kung University (Taiwan), National Taiwan University (Taiwan), University of Southern California (USA), University of South Florida (USA), Michigan State University (USA), Chinese Academy of Sciences (China), IIIT Delhi (India), Hochschule Darmstadt (Germany), Fraunhofer Institute for Computer Graphics Research IGD (Germany).
3 Research program
Our research objective is related to the recognition of human actions, facial expressions, and body language in social interactions. Therefore we plan to work on two main research axes:
-
Axis 1:
Human Interaction Recognition based on body and face analysis,
-
Axis 2:
Data Generation for Augmentation and Anonymization for solving data limitation and privacy issues.
3.1 Axis 1: Human Interaction Recognition
Participants: François Brémond, Michal Balazia, Antitza Dantcheva, Monique Thonnat.
3.1.1 Body Language Analysis
Participants: François Brémond, Michal Balazia, Antitza Dantcheva, Monique Thonnat.
Body language has been actively researched by psychologists for decades. Early work by Mehrabian found that, among other signals, backward leaning of the torso is indicative of liking. Former research has shown that people believe power is expressed with nonverbal cues like open posture (i.e., no arms crossed or legs crossed), more gesturing, and less self-touching (both hands and face). Displacement behaviors such as grooming, face touching or fumbling are related to anxiety and stress regulation. As a consequence of these manifold connections of body language with important personal and social attributes, body language analysis has been a focus of automatic approaches attempting to infer high-level attributes such as emotion leadership role, or personality type. In contrast to the human science studies discussed above, these automatic approaches commonly lack an explicit intermediate representation of functional bodily behavior categories. Instead, they rely on a generic feature representation, encoding body postures and movements or on deep learning approaches without clear interpretable internal structure. While such representations can be effective in prediction scenarios, they often lack interpretability and may miss subtle but meaningful differences, e.g., between fumbling and scratching.
Recognition of Actions and Body Language.
RGB-based human action recognition has often been addressed by three main approaches. Two-stream 2D Convolutional Neural Networks (CNN) generally contain two 2D CNN branches taking different input features extracted from the RGB videos for action recognition. Recurrent Neural Networks (RNN) usually employ 2D CNNs as feature extractors for an LSTM (i.e., Long Short Term Memory) model. 3D CNN based methods extend 2D CNNs to 3D structures, to simultaneously model the spatial and temporal context information in videos that is crucial for action recognition. For instance, a two-stream 2D CNN architecture, divides each video into three segments and processes each segment with a two-stream network, fusing the individual classification scores by an average pooling method to produce the video-level prediction of the action class. Also, the two-stream Inflated 3D CNN (I3D) inflates the convolutional and pooling kernels of a 2D CNN with an additional temporal dimension to process at once a 3D block of pixels. The transformer method that was designed for natural language processing has been recently extended to computer vision tasks to recognize human activities. In contrast to action recognition, which typically considers freely moving people, limited work on body language recognition addressed more constrained social interaction scenarios. We observe that the common denominator of body language analysis methods are the employment of a general action recognition method without handling the specificity of body language such as subtle motions or micro facial expressions.
To summarize, these body language analysis methods enable us to measure objectively the behavior of humans by recognizing their Activities of Daily Living (ADL), their emotions, eating habits, and lifestyle. Human behavior can be modeled by learning from a large number of data, collected from a variety of sensors, to improve and optimize, for instance, the quality of life of people suffering from behavior disorders, such as anxiety or apathy. In previous work, STARS successfully detected the everyday life activities performed by an individual living alone at home and we were able, for instance, to detect breakfast activities, such as “preparing coffee”, and “cutting bread”, with sufficient accuracy 19, 13, 15.
3.1.2 Face Analysis and Emotion Recognition
Participants: François Brémond, Michal Balazia, Antitza Dantcheva.
An emotion is a mental state that arises spontaneously and is often accompanied by cognitive, physical, and physiological changes. Due to the complexity of human reactions, recognizing emotions is still limited and remains the target of many relevant scientific researches. In fact, Emotion Recognition is a highly multidisciplinary field where psychology meets deep learning. Emotions are typically divided in basic categories, as theorized by Ekman who identified basic discrete emotions. Such categorization has been extended considering the interconnection between emotions and multiple intensities.
Predicting emotions has been attempted via facial expression analysis in videos, which has been widely adopted both in research and in industry owing to its ease of use with just a camera. However, the accuracy of computer vision algorithms, as in the case of CNN, is typically limited in identifying real emotions. Facial micro-expression recognition recently reported state-of-the-art performances when implemented with a transformer-based architecture. While the FaceReader system, launched in late 2005, is used worldwide in institutes and companies, there are still some limitations as image quality and facial angulation. Other main open challenges in the field are small available datasets and subjective annotations. Typical datasets range between some hundreds of videos to a few thousands and the annotations are often noisy due to the human complexity. A person may be happy even if he/she is not smiling and people differ widely in how expressive they are in showing their inner emotions. So, emotion annotations are very subjective and need to be adequately addressed. Moreover, emotions have multiple nuances, with different intensities.
Regarding emotional models, various architectures have been used as RNNs, LSTMs, CNNs, with the aim of capturing the spatio-temporal information. In order to improve the recognition accuracy, multimodal transformers have been introduced, exploiting self- and cross-attention. Knowledge distillation from multimodal to unimodal (video) transformers has been reported, to reduce the acquisition complexity at inference time. The state-of-the-art is achieved today with multimodal transformers, using video, audio, and language cues. Here, the video and the audio are processed by small transformer encoders receiving as input features pre-trained on other datasets. The model extracting features is frozen and therefore it cannot be adapted to a new targeted dataset. For the video transformer, the inputs are fixed representations, such as DLN features, IResNet and DenseNet features, Facet/Openface features, R(2+1)D-152 features and landmarks and action units. Such feature extractors and shallower encoders are typically used when small datasets are targeted. The main limitations of this approach are twofold: first, frozen representations are less appropriate for raw data than end-to-end trainable models; second, smaller models are less accurate for recognizing specific expressions. In order to use raw data and bigger encoders, proper pre-training is needed to limit overfitting. While self-supervised techniques, such as VideoMAE, can be used for that purpose, they may miss the little details necessary to recognize facial micro-expressions. They are therefore not well adapted for the emotion recognition task.
3.1.3 Multimodal Recognition of Human Interactions
Participants: François Brémond, Michal Balazia, Monique Thonnat.
Behavior traits can be detected in self-presentation videos based on the acoustic and visual, non-verbal features such as pitch, intensity, movement, head orientation, posture, fidgeting, and eye-gaze. According to 1, 2, modalities such as audiovisual, text, and demographic features are important for personality prediction. Emotion recognition has generated specific approaches for multimodal data processing. Deep bimodal models give state-of-the-art results on Multimodal Language Analysis in the Wild. It has been shown that body gestures, head movements, expressions, and speech lead to an effective diagnosis of apathy. Few models have dealt with trimodal fusion of features.Although multimodal approaches are commonly used to recognize personality traits, there does not exist a comprehensive method to optimize and combine the considerable amount of informative features. All modality features may be concatenated together for behavior prediction; this approach is referred to early fusion. However, most of the multimodal approaches perform late fusion on heterogeneous data, as it outperforms other techniques. Present research in the field aims to find efficient ways for feature extraction and combination. We aim to design new approaches able to utilize all possible information available in an optimal manner 3. The objective is to develop and test Human Behavior Coding algorithms using RGB video cameras at test time 13, 1, but using multi-modalities at training time with multiple datasets with various modalities to better characterize human behavior during interactions. As it is challenging to be an expert in all modalities, we will rely on open-source code (when available) or on our partners (when needed) to obtain the most effective backbone models for extracting multi-modal features. For instance, we are collaborating with DFKI (i.e., Deutsches Forschungszentrum für Künstliche Intelligenz) 24 to extract audio and text features for measuring neuropsychiatric symptoms in patients with early cognitive decline. For electrophysiological signals, we are working with the Biorobotic Institute - Scuola Superiore Sant’Anna (Pontedera, Italy) 21 to compute more objective measurements of emotion.
3.2 Axis 2: Data Generation for Augmentation and Anonymization
Participants: Antitza Dantcheva, François Brémond.
3.2.1 Data Generation
Participants: Antitza Dantcheva, François Brémond.
In the past decade, computer vision has witnessed remarkable progress fueled by the triptych of (a) algorithms for training computer vision models (e.g., backpropagation), (b) increased computational power (think of powerful graphical processing units (GPUs)), but very importantly by (c) increased volumes of training data. For example, millions of facial images (i.e., MegaFace) have rapidly driven progress in face recognition, showcasing that better models are empowered by bigger data. Even in the occasional abundance of raw data, there is a plethora of remaining challenges in designing data-driven intelligence approaches such as deep neural networks (DNNs). These challenges stem from the fact that data must be processed; for example, data must be annotated (e.g., annotation of facial expressions in facial videos), in order to optimize the millions of network-parameters. To make things worse, the curation of large datasets is tedious, costly, time-consuming and is fundamentally bounded by the population sizes of such data, as well as by the ever-increasing privacy and usage considerations that have been recently highlighted by the General Data Protection Regulation (GDPR). The resulting real data and associated real-life datasets are scarce, private, and they inherit human biases. As such, these limitations threaten to bring any advances in computer vision to a dramatic halt. Therefore, we are now at a point, where the availability of annotated data is the main bottleneck in the development of data-hungry DNN models; a bottleneck that far exceeds any algorithmic or computational bottleneck. Based on the premise that computer vision data-driven intelligence is heavily influenced by the underlying data, we here seek to understand how one can actually create data that will augment the learning space and the learning capabilities of computer vision models. Generated data or synthetic data provides a promising solution to the above challenges, as it is easier to obtain, it is inexhaustible, pre-annotated, and less expensive. In addition, synthetic data has the potential to avoid ethical and privacy concerns, as well as practical issues related to security. Further, synthetic data brings to the fore unique opportunities, allowing for the surgical injection of training data in scenarios where collecting real data may be impractical or impossible (e.g., talking dogs, faces that do not exist, etc.). Indeed synthetic data allows for new training paradigms in computer vision models. We will design methods that allow synthetic data to be dynamically generated, directly as a function of the needs of learning algorithms.
Past attempts for synthetic images and videos.
Computer vision-generative models of images have received unprecedented attention, owing to recent breakthroughs in the underlying modeling methodology. The most powerful models today are built on generative adversarial networks (GANs), autoregressive transformers, and most recently diffusion models. Diffusion models (DM) constitute neural networks, which were trained to denoise images successively blurred with Gaussian noise by learning to reverse such diffusion process. After training, such a model can generate data by simply passing randomly sampled noise through the learned de-noising process. This synthesis procedure can be interpreted as an optimization algorithm that follows the gradient of the data density to produce likely samples. In its denoising process, conditional features like class labels of data can be applied to the network for specializing its sampling process. Such DMs outperform previous generative methods, as they offer robust, stable and scalable training procedures. DMs are largely unaffected by training limitations such as overfitting, as it is the case in GANs (mode collapse). In addition, DMs generally involve fewer parameters than transformer-based counterparts that typical require massive amounts of data and thus experience a performance plateau. As diverse synthetic data is a primary need for computer vision, DMs have been rapidly adopted in several settings such as image and video generation, image deblurring, high-resolution image generation, and image editing.
Challenges in video generation.
However, while the image domain has seen great progress, video has proven to be more challenging due to (i) significant computational costs associated with training on video data, as well as due to (ii) the lack of large-scale, general, and publicly available video datasets. In regards to the computational challenge in (i), it is indeed the case that training current state-of-the-art image generation models is already extremely expensive computationally, making it exceedingly hard to generate videos, particularly videos of variable length. Similarly, w.r.t. the second challenge in (ii), it is the case that while in image generation there are datasets with billions of images - in video, datasets are much smaller (think of the VoxCeleb2 dataset of about 1M videos) and thus cannot support the higher complexity of open domain videos.
Limited settings of generated videos. Very recently, video generation methods such as DM-based Imagen Video and Make-a-Video, showcased the stunning potential of generative AI. However, to date, the generated videos remain heavily constrained in quality, resolution, as well as length, mainly due to having video encoders that only encode fixed size videos or encode frames independently. Such video generation methods are further limited as they currently produce results only depicting single persons, performing simple motions in highly constrained settings with mostly a neutral background. Crucial in our effort will be our goal of generating videos that encompass complex settings of multiple subjects, able to interact in front of a non-uniform background.
Control. While we are already beginning to know a few things regarding DMs - like for example that in terms of reconstruction and encoding, DMs are superior to GANs - it is indeed the case that understanding the limits of control of such models, still lies at its infancy. In an effort to control generated images, recent works explored the discovery of semantically meaningful directions in the latent space of pre-trained GANs, where linear navigation corresponds to the desired manipulation of images. In this context and in terms of control, supervised, as well as unsupervised approaches were proposed to edit semantics such as facial attributes, colors and basic visual transformations (e.g., rotation and zooming) in generated or inverted real images. The latest addition of Latent Diffusion Models (LDMs) are a positive development in this direction, as such LDMs are able to reduce the heavy computational burden when training on high-resolution images. In addition, our own work revealed - in the context of autoencoder generation models - how to disentangle motion and appearance in videos, as well as how to manipulate decomposed semantically meaningful motion-directions. However, in the context of LDMs, disentanglement and manipulation of semantic attributes remains a key open research challenge of substantial potential impact and these are indeed challenges that we will explore.
3.2.2 Data Augmentation and Anonymization
Participants: Antitza Dantcheva, François Brémond.
We aim to apply data generation models proposed in the previous section in two domains of application, namely data augmentation and data anonymization, which are catering the needs of Axis 1 (Human Interaction Recognition).
Data augmentation.
The general focus of data-driven computer vision algorithms has to do with the automated extraction of patterns by finding complex data representations from large volumes of input data without human interference, utilizing the patterns to detect or classify unseen data. The powerful twist that we are envisioning is that data generation places full control over the distribution of the generated data, thus endowing us with the ability to ensure quality and diversity, while saving cost, and mitigating bias. As a consequence, we foresee that such synthetic data will allow for nothing less than a paradigm shift in training. For example, as inspired by human systems, synthetic data will bring continual, multimodal, interactive, embodied learning to the next level, providing richer and more sophisticated representations. This applies directly toward the grand goal of allowing computer vision to approach human-level intelligence; a long-term goal that will require the grasping of key concepts related to the physical world and its composition, as well as to entail a non-diluted ability to learn continually, interactively and multimodally 23. We aim to identify entirely new perception models and related learning paradigms, which will exploit synthetic data in an entirely new, efficient and dynamic manner. We consider such models for a variety of recognition settings that can target a broad spectrum of facial behaviors including expressions and micro-expressions. By exploring the fundamental properties of learning with synthetic data, we anticipate computer vision models that generalize onto a large class of human actions.
Data anonymization.
Privacy-preserving data-processing has obtained increased attention in the past years, with challenges having to do with data anonymization, while maintaining the image quality. The General Data Protection Regulation (GDPR) came to effect as of 25th of May, 2018, affecting all processing of personal data across Europe. GDPR requires regular consent from the individual for any use of their personal data. However, if the data does not allow to identify an individual, companies are free to use the data without consent. To effectively anonymize images, we require a robust model to replace the original face, without destroying the existing data distribution; that is: the output should be a realistic face fitting the given situation.
Anonymizing images, while retaining the original distribution is challenging, as it entails the removal of all privacy-sensitive information, generation of a highly realistic face, while providing a seamless transition between original and anonymized parts. This requires a model that can perform complex semantic reasoning to generate a new anonymized face. For practical use, we desire the model to be able to manage a broad diversity of images, poses, backgrounds, and different persons. Our proposed solution can successfully anonymize images in a large variety of cases, and create realistic faces to the given conditional information.
4 Application domains
Video understanding consists of a complex pipeline made of various tasks, such as object detection, people tracking, pose estimation, and event detection. So, many tasks are generic, and can be shared between different application domains. The behavior analysis techniques we develop for other applications (for instance for sport or security domains) can be applied to medical applications and vice-versa.
4.1 Medical Applications
Our main motivation as explained before is to help clinicians to diagnose, monitor and provide pertinent treatment to patients with behavior disorders. The applications we target are not general medical diseases but the ones related to the brain and more precisely to psychiatric disorders. These disorders can appear very early in the life of the patient (for instance autism spectrum disorder 4), they can concern adults (depression, bipolar, schizophrenia 25) or the elderly (for instance Alzheimer disease). We have been working for the elderly patients since the creation of the CoBTek joint team in January 2012. More recently, we have extended our study to the two other categories of age. Now we have some clinical trials within these three categories of patients.
4.2 Other Applications
Sport applications: Sport is an interesting application domain for human activity understanding for three reasons. First, data are often publicly available, so with less ethical concerns than medical ones. Moreover, many data have been recorded and annotated to be part of international challenges Website Challenges. Second, human activities are complex at the level of individuals, of a team and along time. Third, many companies are interested to fund research to advance the field of human activity understanding for sport. For instance, we have a collaboration with a local company, Fairvision (see Fairvision website on football games).
Security applications: The interest and investment in vision-based security systems is large and rapidly growing and is fueled by applications ranging from autonomous vehicles to personalization of customer service. Accordingly, numerous companies, military and public organizations are interested in research in this context.
4.3 Ethical and Acceptability Issues
The development and ultimate use of novel assistive technologies by a vulnerable user group such as individuals with dementia, and the assessment methodologies planned by STARS are not free of ethical, or even legal concerns, even if many studies have shown how these Information and Communication Technologies (ICT) can be useful and well accepted by older people with or without impairments. Thus, one goal of STARS team is to design the right technologies that can provide the appropriate information to the medical carers while preserving people privacy. Moreover, STARS pay particular attention to ethical, acceptability, legal and privacy concerns that may arise, addressing them in a professional way following the corresponding established EU and national laws and regulations, especially when outside France. STARS can also benefit from the support of the COERLE (Comité Opérationnel d'Evaluation des Risques Légaux et Ethiques) to help it to respect ethical policies in its applications.
As presented in Section 2, STARS aims at designing cognitive vision systems with perceptual capabilities to efficiently monitor people activities. As a matter of fact, vision sensors can be seen as intrusive ones, even if no images are acquired or transmitted (only meta-data describing activities need to be collected). Therefore, new communication paradigms and other sensors (e.g. accelerometers, RFID (Radio Frequency Identification), and new sensors to come in the future) are also envisaged to provide the most appropriate services to the observed people, while preserving their privacy. To better understand ethical issues, STARS members are already involved in several ethical organizations.
For addressing the acceptability issues, focus groups and HMI (Human Machine Interaction) experts are consulted on the most adequate range of mechanisms to interact and display information to older people.
5 Social and environmental responsibility
5.1 Footprint of research activities
We have limited our travels by reducing our physical participation to conferences and to international collaborations.
5.2 Impact of research results
We have been involved for many years in promoting public transportation by improving safety onboard and in station. Moreover, we have been working on pedestrian detection for self-driving cars, which will help also reducing the number of individual cars.
6 Highlights of the year
6.1 Awards
- Antitza Dantcheva was appointed 3IA chair.
- Monique Thonnat has been nominated Coordinatrice Alpes Maritimes for the foundation FUAE Fondation Un Avenir Ensemble of Grande Chancellerie de la Legion d'Honneur (Website Fondation). The objective is to promote social mobility by offering recipients of national honors the opportunity to mentor deserving and motivated students from high school to higher education and entry into working life.
6.2 Major results
- A first work has consisted of releasing novel tracking algorithms that can reliably track people through a video stream. These algorithms can combine bounding box detection with pixel mask to significantly improve the quality of tracking and to be able to track people on a long-term basis.
- During this period, several novel activity recognition algorithms have also been designed for Activities of Daily Living (ADLs) in real-world settings. These algorithms got the best performances on all relevant action datasets. Previously, these algorithms were built in more or less supervised settings. Thus, we have proposed new algorithms for action detection with a weakly supervised setting with only video-level labels. These algorithms can reliably detect specific events with their time of occurrence within untrimmed videos.
- We have also improved the quality and the capacity of action recognition algorithms by processing long videos with a duration of more than 10 minutes. For that, we have designed new adapters that can be plugged into strong video backbones and thus necessitate only retraining the adapters, which reduces the training time and enables a training process with videos of a much longer duration.
- We have also designed novel algorithms for video action anticipation that can detect some possible events after having observed only a limited amount of normal video streams.
- All these algorithms have been successfully evaluated on the main international benchmarks and also on video datasets depicting patients with cognitive disorders in order to help doctors to better monitor their patients.
7 Latest software developments, platforms, open data
7.1 Open data
We have provided two benchmark datasets.
Stress ID Dataset: a Multimodal Dataset for Stress Identification
-
Contributors:
Hava Chaptoukaev , Valeriya Strizhkova , Michele Panariello , Bianca Dalpaos , Aglind Reka , Valeria Manera , Susanne Thummler , Esma Ismailova , Nicholas Evans , François Brémond , Massimiliano Todisco , Maria A Zuluaga , Laura M Ferrari .
-
Description:
It contains RGB facial video, audio and physiological signals (ECG, EDA, Respiration). Different stress-inducing stimuli are used: emotional video-clips, cognitive tasks and public speaking. The total dataset consists of recordings from 65 participants that performed 11 tasks. Each task is labeled by the subjects in terms of stress, relaxation, arousal, and valence. The experimental set-up ensures synchronized, high-quality, and low noise data.
- Dataset PID (DOI,...):
- Project link:
-
Publications:
StressID: a Multimodal Dataset for Stress Identification Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track 2023 7
-
Contact:
stressid.dataset@inria.fr
-
Release contributions:
The Dataset is licensed for non-commercial scientific research purposes.
Toyota Smarthome Datasets: Real-World Activities of Daily Living.
-
Contributors:
Rui Dai , Srijan Das , Saurab Sharma , Luca Minciullo , Lorenzo Garattoni , François Brémond , Gianpiero Francesca .
-
Description:
Smarthome has been recorded in an apartment equipped with 7 Kinect v1 cameras. It contains the common daily living activities of 18 subjects. The subjects are senior people in the age range 60-80 years old. The dataset has a resolution of 640×480 and offers 3 modalities: RGB + Depth + 3D Skeleton. The 3D skeleton joints were extracted from RGB. For privacy-preserving reasons, the face of the subjects is blurred. Currently, two versions of the dataset are provided: Toyota Smarthome Trimmed and Toyota Smarthome Untrimmed.
-
Dataset PID (DOI,...):
10.1109/TPAMI.2022.3169976
- Project link:
-
Publications:
Toyota Smarthome Untrimmed: Real-World Untrimmed Videos for Activity Detection, PAMI 2022 18.
-
Contact:
toyotasmarthome@inria.fr
-
Release contributions:
The Dataset is licensed for non-commercial scientific research purposes.
8 New results
This year Stars has proposed new results related to its two main research axes: (i) Human Interaction Recognition and (ii) Data Generation for Augmentation and Anonymization.
Human Interaction Recognition
Participants: François Brémond, Antitza Dantcheva, Michal Balazia, Monique Thonnat, Baptiste Chopin, Di Yang, Abid Ali, Olivier Huynh, Tomasz Stanczyk, Sanya Sinha, Mohammed Guermal, Tanay Agrawal, Snehashis Majhi, Aglind Reka.
The new results for Human Interaction Recognition are:
- No Train Yet Gain: Towards Generic Multi-Object Tracking in Sports and Beyond (see 8.1)
- Does Re-ID Really Help in Multi-Object Tracking? (see 8.2)
- CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets (see 8.3)
- Are Attention Maps Richer than we Imagined for Action Recognition? (see 8.4)
- Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation (see 8.5)
- SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living (see 8.6)
- LLAVIDAL : A Large LAnguage VIsion Model for Daily Activities of Living (see 8.7)
- Human-Centric Video Understanding: From Single-Modality to Multi-Modal Learning (see 8.8)
- B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition (see 8.9)
- Loose Social-Interaction Recognition in Real-world Therapy Scenarios (see 8.10)
- Just Dance with π!, A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection (see 8.11)
- Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection (see 8.12)
- Denoise, Divide, Distill, and Predict (): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy (see 8.13)
- Not All Blends Are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations (see 8.14)
- The INEMO Dataset: A Multimodal Benchmark of Physiological and Behavioral Responses to Social Media and Film Stimuli (see 8.15)
- EEG Classification with Limited Data: A Deep Clustering Approach. (see 8.16)
- MEPHESTO: Multimodal Phenotyping of Psychiatric Disorders from Social Interaction (see 8.17)
- MultiMediate'25: Cross-Cultural Multi-domain Engagement Estimation (see 8.18)
- Stress Estimation in Dancers for Injury Prevention (see 8.19)
- Emotion Recognition using Deep Learning (see 8.20)
- Identifying Surgical Instruments in Pedagogical Cataract Surgery Videos through an Optimized Aggregation Network (see 8.21)
- TBDM: Temporal Boundary Distillation Module for Surgical Gesture Segmentation (see 8.22)
- Effective Video Feature Extraction for Training and Comprehension: Human-Centered Multimodal Video (see 8.23)
Data Generation for Augmentation and Anonymization
Participants: François Brémond, Antitza Dantcheva, Baptiste Chopin, Nabyl Quignon, Charbel Yahchouchi, Anil Egin, Michal Balazia, Di Yang, Valeriya Strizhkova.
The new results for Data Generation for Augmentation and Anonymization are:
- Rotation-Induced Centroid Shift in Latent Space (see 8.24)
- Dual Volume Skeleton-Guided 3D Face Reconstruction from Sparse Views (see 8.25)
- Turbo Learning: 3D Face Reconstruction with Mesh Re-Projection and Re-Identification Consistency (see 8.26)
- THEval. Evaluation Framework for Talking Head Video Generation (see 8.27)
- Beyond Real versus Fake Towards Intent-Aware Video Analysis (see 8.28)
- AI killed the video star. Audio-driven diffusion model for expressive talking head generation (see 8.29)
- LIA-X: Interpretable Latent Portrait Animator (see 8.30)
- Simplicity-Bias-Aware Adaptation of Foundation Models for Deepfake Detection (see 8.31)
- Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos (see 8.32)
- Beyond the visible: A survey on cross-spectral face recognition (see 8.33)
8.1 No Train Yet Gain: Towards Generic Multi-Object Tracking in Sports and Beyond
Participants: Tomasz Stanczyk, Seongro Yoon, Francois Bremond.
We proposed McByte 46, a novel tracking-by-detection framework that enhances multi-object tracking (MOT) by integrating temporally propagated segmentation masks as an additional association cue. The key objective was to improve robustness and generalization in challenging sports scenarios - characterized by fast motion, occlusions, blur, and camera shifts - without requiring any training or per-sequence parameter tuning.
Starting from a strong ByteTrack-based baseline, we designed a pipeline that combines Kalman filter motion prediction, IoU-based matching, and a pre-trained mask temporal propagation model. The propagated masks are not used blindly; instead, we introduced regulated policies that activate mask-based guidance only in well-defined situations - namely ambiguity (multiple plausible associations) and isolation (failure of IoU-based matching). This controlled fusion ensures that the mask cue strengthens association decisions while avoiding instability caused by unreliable mask predictions.
Figure 3 illustrates the full tracking pipeline, showing how bounding-box predictions, detections, and temporally propagated masks are jointly integrated into a unified association cost matrix solved via Hungarian matching. This design allows McByte to preserve the strengths of tracking-by-detection while benefiting from the spatial coherence provided by mask propagation.
The image illustrates a tracking process in video analysis. It begins with tracklet boxes from the previous frame (t-1) processed through a Kalman filter to predict tracklet boxes for the current frame (t). In parallel, an object detector identifies detection boxes in the current frame. These predictions and detections are matched using Hungarian matching assignment to update tracklets. Masks from frame t-1 are propagated temporally and matched to detections using IoU-based and mask-enhanced matching to ensure accurate tracking of objects across frames, shown in a basketball scene with two players. (Description generated at January 19th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
We conducted extensive ablation studies to analyze the impact of each design choice, demonstrating that uncontrolled use of masks can degrade performance, whereas carefully gated mask usage yields consistent gains. Qualitative results further show McByte’s ability to maintain identities through heavy occlusions and motion blur. In particular, Fig. 4 highlights challenging football scenarios where McByte successfully preserves tracklets that baseline methods fail to maintain due to abrupt camera motion and degraded visual quality.
We evaluated McByte on four diverse datasets - SportsMOT, DanceTrack, SoccerNet-tracking 2022, and MOT17 - using standard MOT metrics (HOTA, IDF1, MOTA). Across all benchmarks, McByte consistently outperformed strong tracking-by-detection baselines, especially in sports datasets, while remaining competitive on pedestrian tracking. Importantly, these improvements were achieved without training, dataset-specific tuning, or additional annotations, demonstrating the method’s generality and practical value.
Overall, this work introduces a generic, training-free MOT framework that bridges the gap between detection-based and mask-based tracking, offering a robust solution applicable across sports and non-sports domains.
|
|
| Baseline | McByte |
The image depicts a soccer match in the Barclays Premier League between Liverpool (LIV) and Manchester United (MU) with a score of 1-2. Liverpool is playing with 10 men due to a red card to Gerrard at the 46th minute. The time shown is 86:15. The scene focuses on the goal area, where the goalkeeper and other players are positioned. Several players are highlighted with numbers such as 160, 131, 112, 106, 132, 177, and 144. Arrows point to the goalkeeper and two other players near the goalpost, indicating their positions and possible actions. (Description generated at January 19th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.2 Does Re-ID Really Help in Multi-Object Tracking?
Participants: Tomasz Stanczyk, Francois Bremond.
We conducted a systematic and critical analysis of the role of person re-identification (re-ID) in multi-object tracking (MOT) 49. While re-ID is widely assumed to improve association quality, its actual contribution in practical tracking pipelines remains unclear. Our goal was to rigorously evaluate when, how, and to what extent re-ID genuinely benefits MOT performance.
We focused our study on the widely used BoT-SORT tracking framework and evaluated multiple re-ID configurations, including re-ID trained on the target dataset, re-ID trained on external datasets, and a strong generic re-ID model. Experiments were conducted on the MOT17 validation set, using both ground-truth detections and realistic detector outputs to disentangle the effects of detection quality from appearance-based association.
Beyond standard tracking evaluations, we introduced a custom re-ID assessment protocol tailored to tracking. This protocol directly measures correct and incorrect inter-frame matches produced by re-ID, enabling a deeper understanding of re-ID behavior in realistic tracking scenarios. We analyzed cosine distance distributions, match accuracy, and failure modes across sequences with varying crowd density, occlusion patterns, and bounding-box sizes.
Our results show that re-ID often provides only marginal gains and, in several scenarios, can even degrade tracking performance, especially when bounding boxes are small, heavily occluded, or visually ambiguous. We further demonstrated that tuning re-ID similarity thresholds is non-trivial and highly sequence-dependent, undermining the robustness and general applicability of re-ID-based association.
To mitigate these issues, we explored constraints on re-ID usage, such as filtering based on occlusion level and minimum bounding-box size. While these constraints reduced incorrect matches in isolation, their impact on full tracking performance remained limited and inconsistent across sequences.
Overall, this work provides evidence-based insight into the limitations of re-ID in MOT and challenges the assumption that stronger re-ID models automatically lead to better tracking. We conclude that re-ID is not a universally reliable solution for improving MOT and that its effectiveness is strongly conditioned on scene characteristics, detection quality, and careful integration into the tracking pipeline.
This study offers practical guidance for both researchers and practitioners, encouraging more critical and context-aware use of re-ID in future MOT systems.
8.3 CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets
Participants: Tanay Agrawal, Mohammed Guermal, Michal Balazia, Francois Bremond.
Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP (i.e., natural language processing), adapters and prefix tuning, we present a new model-agnostic plugin architecture for cross-learning, called CM3T 36, that adapts transformer-based models to new or missing information (see Figure 5). We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be fine-tuned along with these additions.
Backbones pretrained using self-supervised learning provide good general features, thus all methods of fine-tuning work well. In the case of supervised pretraining, adapters fail to perform well (in red) and CM3T is introduced to solve this (in green).
Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and only 22.3% trainable parameters for two additional modalities, we achieve comparable and even better results than the state-of-the-art. CM3T has no specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications of video classification.
8.4 Are Attention Maps Richer than we Imagined for Action Recognition?
Participants: Tanay Agrawal, Abid Ali, Francois Bremond.
Deep learning models are becoming more general and robust by the day. Specifically, image foundation models have recently shown exponential growth. We introduce a way to exploit this growth in the field of video classification. The basic idea here is that if we have a good understanding of space, we should not require complicated spatio-temporal processing. We introduce the Attention Map (AM) flow, a way to identify the location of local changes between two frames in a video, without adding additional parameters specifically for it. We utilize adapters, which have been growing in popularity in the field of parameter-efficient transfer learning. These help us incorporate AM flow in a pretrained image model without the need of fine-tuning it. With just these changes and minimal temporal processing, an image model is able to achieve state-of-the-art results on popular action recognition datasets with low training time and requiring minimal pretraining. This work explores the theory behind this idea and the intricacies involved. Through relevant experiments, we show the efficacy of this method and discuss various ideas to take this work forward. We use kinetics-400, something-something v2, and the Toyota SmartHome datasets and achieve state-of-the-art or comparable results. We also show that video models suffer from extensive pretraining on multiple datasets and a large training time, but our work answers these problems.
This work has been published at WACV 2025 35.
8.5 Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation
Participants: Tanay Agrawal, Abid Ali, Francois Bremond.
Temporal Action Detection (TAD) is essential for analyzing long-form videos by identifying and segmenting actions within untrimmed sequences. While recent innovations like Temporal Informative Adapters (TIA) have improved resolution, memory constraints still limit large video processing. To address this issue, we introduce AdaTAD++, an enhanced framework that decouples temporal and spatial processing within adapters, organizing them into independently trainable modules. Our novel two-step training strategy first optimizes for high temporal and low spatial resolution, then vice versa, allows the model to utilize both high spatial and temporal resolutions during inference, while maintaining training efficiency. Additionally, we incorporate a more sophisticated temporal module capable of capturing long-range dependencies more effectively than previous methods. Experiments on benchmark datasets, including ActivityNet-1.3, THUMOS14, and EPIC-Kitchens 100, demonstrate that AdaTAD++ achieves state-of-the-art performance. We also explore various adapter configurations, discussing their trade-offs regarding resource constraints and performance, providing valuable insights into their optimal application.
This work has been published at ICCV 2025 38.
8.6 SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living
Participants: Arkaprava Sinha, Dominick Reilly, Francois Bremond, Srijan Das.
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks. Our code is available at this github Github page.
This work has been published at AAAI 2025 45.
8.7 LLAVIDAL : A Large LAnguage VIsion Model for Daily Activities of Living
Participants: Dominick Reilly, Francois Bremond, Srijan Das.
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS (i.e., RGB and Segmentation) instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art (SOTA) performance across ADL benchmarks.
This work has been published at CVPR 2025 43.
8.8 Human-Centric Video Understanding: From Single-Modality to Multi-Modal Learning
Participants: Mahmoud Ali, Di Yang, Francois Bremond.
General pipeline of MoVie for action detection
Human action recognition is an active research field with significant contributions to applications such as home-care monitoring, human-computer interaction, and game control. However, recognizing human activities in real-world videos remains challenging, especially when learning effective video representations with a high expressive power to represent human spatio-temporal motion, view-invariant actions, complex composable actions, etc. To address this challenge, we made three contributions toward learning effective representations that can be applied and evaluated in real-world human action classification, retrieval, prediction, detection, and segmentation tasks by transfer learning.
The first contribution (single modality): we improve the generalizability of human skeleton motion representation models under the skeleton-only modality. We introduce two novel self-supervised learning frameworks based on contrastive learning to learn robust and transferable skeleton representations without relying on action labels. By exploiting the inherent spatio-temporal structure of human skeleton sequences, our approach encourages discriminative motion representations through instance-level and temporal consistency objectives. Extensive evaluations demonstrate that the proposed frameworks improve performance across diverse downstream tasks and scenarios, bridging the gap between controlled 3D laboratory datasets (e.g., NTU-RGB-D) and challenging 2D real-world datasets (e.g., SmartHome), highlighting the strength of SSL (i.e., Self-Supervised Learning) for skeleton-based motion understanding.
The second contribution (two modality): Despite the effectiveness of skeleton-based models in capturing spatial and temporal dynamics, they struggle to recognize fine-grained actions. In particular, they fail to distinguish between semantically similar actions, such as "drinking from a cup" versus "drinking from a bottle", as these models lack access to object-centric and semantic information. To address this, we propose MoVie as shown in Fig. 6, a motion-augmented framework designed to improve real-world human action detection by integrating skeleton motion features with visual information through the Motion-Vision Mixer and incorporating history-aware temporal modeling.
Overview of the framework
The third contribution (multi modality): our previous works show that VLFMs (i.e., Vision-Language Foundation Models) are still far away from satisfactory performance in all evaluated tasks, particularly in densely labeled and long video datasets, such as the fine-grained activities in complex and real-world scenarios. As shown in Fig. 7, we introduce our proposed Transferable skeleton MOtion Representation learning architecture (T-MOR) based on a contrastive motion-video-language pre-training strategy. The pre-trained skeleton model is effective for both action classification, segmentation and zero-shot action recognition tasks.
Overall, this work contributes to the field of human-centric video understanding by proposing novel methods for skeleton-based action representation learning and general RGB video representation learning. Such representations benefit both action classification and segmentation tasks.
8.9 B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition
Participants: Aglind Reka, Nishit Poddar, Diana Borza, Snehashis Majhi, Michal Balazia, Francois Bremond.
Micro-action recognition (MAR) presents unique challenges due to the inherently subtle, fleeting, and ambiguous nature of micro-actions. Unlike conventional actions, which are often clearly distinguishable, micro-actions, such as a slight nod, a subtle shift in posture, or a brief glance are characterized by their fine-grained motion and short duration. These movements often overlap in meaning and arise from reflexes or situational cues, making them difficult to interpret and classify. Additionally, micro-actions are influenced by environmental and social factors, further complicating their recognition.
A significant issue in current approaches is the failure to account for the structured nature of human motion. Micro-actions often originate from specific body parts, such as the head, torso, or limbs, and follow a consistent body-to-action hierarchy. However, most existing models treat these actions as flat categories, overlooking the spatial dependencies between body regions. This oversight leads to difficulties in isolating informative signals from background noise and differentiating between highly similar micro-movements within the same body region. Another challenge lies in the imbalance and variability of micro-action datasets. Datasets like MA-52, SocialGesture, and MPII-GroupInteraction capture a wide range of human movements, from short, dynamic gestures to long, static postures. This variability in temporal scale and class frequency makes it challenging for models to capture rare yet distinctive motion patterns, which are characteristic of micro-actions.
To address these challenges, we introduce B-MoE, a body-part-aware Mixture-of-Experts framework (see Figure 8) designed to explicitly model the structured nature of human motion. B-MoE specializes in analyzing motions from localized body regions such as the head, torso, upper limbs, and lower limbs, allowing the model to focus on subtle movements and discriminative cues within each region. By doing so, B-MoE suppresses background interference and enhances the detection of fine-grained motion cues, improving the ability to differentiate between ambiguous action classes. Central to B-MoE is the Macro–Micro Motion Encoder (M3E) as shown in Figure 9, a lightweight yet powerful backbone that captures both long-range contextual structure and fine-grained local motion. This dual capability enables the model to effectively recognize both prolonged poses and rapid micro-movements. A cross-attention routing mechanism further enhances the framework by dynamically selecting and fusing informative region-wise semantic cues, as shown in Figure 10, which are then integrated with global motion features. Through this approach, B-MoE effectively addresses the core challenges of MAR subtlety, ambiguity, and class imbalance by amplifying fine local cues, suppressing irrelevant regions, and providing complementary semantic and motion evidence. This work was submitted to CVPR 2026.
The image depicts a flowchart of a machine learning model. It takes video input and processes it through multiple branches: semantic branches for different body parts (head, body, upper limb, lower limb), a semantic encoder, and a motion encoder. The semantic branches are frozen, while the experts, which are learnable components, adapt. The outputs from these branches and encoders converge and pass through a series of modules including cross-attention mechanisms, a transformer, and a multi-layer perceptron (MLP). The process aims to learn and integrate semantic and motion features for analysis. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
The image depicts a neural network architecture consisting of an SGP layer and a semantic embedding alignment mechanism. The SGP layer includes components like MHSA (Multi-Head Self Attention), ConvC (Convolution), and fully connected (FC) layers. The process starts with input (T, D) through MHSA, followed by several convolutional and pooling operations. The semantic embedding alignment, used only during pre-training, aligns word embeddings through a TAN module with fully connected layers, minimizing the embedding loss. (Description generated at January 16th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
The image depicts a flowchart of a video processing system. It begins with a video input of a person sitting. The video frame (T, C, H, W) is first processed by a module called "Sapiens," likely for segmentation, producing segmentation maps (T, H, W). These maps are combined with another input, represented by a bone structure image, which then goes to a "Semantic Encoder." The output is a feature representation (T/16, 1408) (8 x 1408). Snowflake icons indicate the use of frozen parameters or pre-trained weights in these modules. (Description generated at January 16th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
Our extensive experiments on three socially contextual micro-action benchmarks (MA-52, MPII-GI, and SocialGesture) demonstrate significant improvements, with notable gains in F1macro accuracy of +4.32%, +3.35%, and +1.17%, respectively. These results highlight B-MoE’s robustness in handling class imbalance and its superior performance in recognizing subtle and ambiguous actions recognition of ambiguous, underrepresented, and low-amplitude actions.
8.10 Loose Social-Interaction Recognition in Real-world Therapy Scenarios
Participants: Abid Ali, Monique Thonnat, Francois Bremond.
The computer vision community has explored dyadic interactions for atomic actions such as pushing, carrying-object, etc. However, with the advancement in deep learning models, there is a need to explore more complex dyadic situations such as loose interactions. These are interactions where two people perform certain atomic activities to complete a global action irrespective of temporal synchronization and physical engagement, like cooking-together for example. Analyzing these types of dyadic-interactions has several useful applications in the medical domain for social-skills development and mental health diagnosis. To achieve this, we propose a novel dual-path architecture to capture the loose interaction between two individuals. Our model learns global abstract features from each stream via a CNNs backbone and fuses them using a new Global-Layer-Attention module based on a cross-attention strategy. We evaluate our model on real-world autism diagnoses such as our Loose-Interaction dataset, and the publicly available Autism dataset for loose interactions. Our network achieves baseline results on the Loose-Interaction and SOTA results on the Autism datasets. Moreover, we study different social interactions by experimenting on a publicly available dataset i.e. NTU-RGB+D (interactive classes from both NTU-60 and NTU-120). We have found that different interactions require different network designs. We also compare a slightly different version of our method by incorporating time information to address tight interactions achieving SOTA results.
This work has been published at WACV 2025 37.
8.11 Just Dance with π!, A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection
Participants: Snehashis Majhi, Giacomo D’amicantonio, Antitza Dantcheva Ali, Francois Bremond.
Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatiotemporal features, which continues to limit their reliability in real-world scenarios. This is because RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features with additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: “PI-VAD” (or π-VAD), a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three-dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. π-VAD includes two plug-in modules, namely the Pseudo-modality Generation module and the Cross Modal Induction module, which generate modality-specific prototypical representations and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones – only during training. Notably, π-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.
This work has been published at CVPR 2025 40.
8.12 Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection
Participants: Snehashis Majhi, Giacomo D’Amicantonio, Dantcheva Antitza, Francois Bremond.
We identify one of the main issues in the formulation of the Weakly-supervised video anomaly detection (WSVAD) task. Multi-instance learning (MIL) strikes a balance between fully supervised methods, which exhibit good performance but require costly data annotation, and unsupervised methods, which do not require manual annotations but generally result in worse performance. The core idea of MIL is to create bags containing positive and negative data samples (i.e., normal and abnormal videos), labeled only at the video-level. During training, the model assigns a score between 0 and 1 to each snippet, with 0 indicating a normal snippet and 1 indicating an abnormal snippet. The highest-scoring samples in the normal bag are guided towards 0, allowing the model to learn most normal scenarios correctly. On the other hand, the highest-scoring negative samples are pushed towards 1. This leads the model to be supervised, and therefore learn few and specific instances of anomalous events, ignoring useful information contained in neighboring snippets. Over time, this approach has proved to be powerful but insufficient to train a model to correctly capture the secondary and specific attributes of different anomalous classes. In recent works, different auxiliary objectives are identified as priors for the VAD task to optimize the training process.
Overview of the GS-MoE architecture
To address this issue, we propose to model the anomalies in a video as Gaussian distributions (see Fig. 11), rendering multiple Gaussian kernels in correspondence with peaks detected along the temporal dimension of the scores estimated for abnormal videos. This technique, called Temporal Gaussian Splatting (TGS), creates a more complete representation of an anomalous event over time, including snippets of the anomaly with lower abnormal scores in the training objective. The Gaussian kernels are extracted from the abnormal scores produced by the model.
An additional challenge is related to the intrinsic differences between abnormal classes. Under the MIL paradigm, the models are trained to learn the difference between normal and abnormal videos, while the specific differences between anomalous classes are overlooked. As a result, these methods mainly focus on coarse-level representations of anomalies that allow us to distinguish between normal and abnormal events, but ignore the fine-grained category-specific cues. Therefore, the more salient anomalies (i.e., such as an explosion) are likely to be easily detected, while subtle anomalies (i.e., shoplifting) are more likely to be confused with normal events. This constitutes a major limitation of most recent methods based on WSVAD. We address this issue via a Mixture-of-Expert (MoE) architecture, in which each expert is trained to model a single anomaly class, enhancing the specific attributes of each anomaly class that are often overlooked. To further leverage the correlations and differences between anomalies, a gate model mediates between the predictions of each expert and the more coarse-level anomalous features to learn potential interactions between anomalies.
This work has been published at ICCV 2025 48.
8.13 Denoise, Divide, Distill, and Predict (D3¶): Towards Forecasting Long-horizon Real-world Anomaly from Normalcy
Participants: Quentin Merilleau, Snehashis Majhi, Dantcheva Antitza, Francois Bremond.
Forecasting abnormal human behavior (AHB) in unconstrained real-world environments is critical for enabling proactive safety interventions 42. Unlike short-term anomaly detection, long-horizon forecasting offers a vital reaction window but remains underexplored due to three core challenges: (i) noisy, complex human–agent interactions; (ii) weak temporal coupling between normal observations and distant anomalies; and (iii) data scarcity limiting the scalability of autoregressive models. To address these, we propose (Denoise, Divide, Distill, and Predict) displayed in Fig. 12, a novel encoder–decoder framework that bridges denoised pasts with distilled autoregressive futures, which has been accepted for publication in WACV 2026. Our Differential Past Encoder (DiPE) disentangles scene-level and object-level dynamics via differential attention, suppressing irrelevant interactions and enhancing discriminative cues. The Distilled Future Auto-Regressive Decoder (D-FAD) adopts a divide-and-conquer strategy, segmenting future queries into temporal chunks for sequential prediction, while leveraging distillation to balance robustness and latency. We validate our approach on the AHB-F benchmark, the only dataset dedicated to abnormal behavior forecasting, and further integrate D-FAD with several state-of-the-art methods. In all cases, our framework consistently outperforms prior work in both forecasting accuracy and computational efficiency.
The image compares traditional video anomaly detection with a new video anomaly anticipation method. In the traditional method, labeled "Video Anomaly Detection," anomalies are identified in the current frame by analyzing past frames and classifying future frames as normal or anomalous. The new method, "Our Video Anomaly Anticipation," not only detects anomalies in the current frame but also predicts future anomalies by utilizing both short-term (1-3 seconds) and long-term (4-8 seconds) anticipations. The diagram is divided into sections showing offline and online processes. Offline training involves multiple frames, while online detection and anticipation use a reference model (Dref) to predict anomalies. The overall aim is to enhance early detection of irregular events in videos. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.14 Not All Blends Are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations
Participants: Michal Balazia, Teimuraz Saghinadze, Francois Bremond.
(Both paper and competition are accepted at FG 2026)
Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) BLended EMOtion REcognition (see Figure 13) that includes information on the relative salience of each emotion within a blend.
Examples of stills from the video recordings
The image shows a sequence of four photos featuring a person with dark hair tied back, wearing a black shirt. The photos progressively depict increasing expressions of emotion, starting from a neutral expression and moving through stages of surprise or excitement. The person's mouth opens wider and their facial muscles tense more in each subsequent photo. The background is a plain, neutral gray. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions (anger, disgust, fear, happiness, sadness, and neutral) and 10 distinct blends consisting of all pairwise combinations of anger, disgust, fear, happiness, and sadness. All pairwise combinations (see Figure 14) were further conveyed with three different blend conditions:
- 50/50 = same amount of both emotions (e.g. 50/50 happiness-sadness where both happiness and sadness are expressed in equal proportions)
- 70/30 = the first emotion is more salient than the second emotion (e.g. 70/30 happiness-sadness conveys mainly happiness blended with a tinge of sadness)
- 30/70 = the second emotion is more salient than the first emotion (e.g. 30/70 happiness-sadness conveys mainly sadness blended with a tinge of happiness)
Structure of the BLEMORE full dataset
Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind+WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2+HuBERT) and 18% salience accuracy (HiCMAE).
BLEMORE dataset is also the basis of BLEMORE competition where participants develop systems to predict the emotions present in each recording and the relative salience of each emotion. To support participation, we provide training data with labels, test data without labels, pre-extracted audio-visual feature embeddings, and baseline unimodal and multimodal classification results. The competition offers the first comprehensive platform for evaluating blended emotion recognition and aims to stimulate methodological innovation in multimodal affective computing.
8.15 The INEMO Dataset: A Multimodal Benchmark of Physiological and Behavioral Responses to Social Media and Film Stimuli
Participants: Wenxin Xiong, Valeriya Strizhkova, Aowen Shi, Michal Balazia, Laura Ferrari, Francois Bremond.
The INEMO dataset is a multimodal benchmark designed to study emotional and behavioral responses to influencer-style social media videos and emotion calibration film clips. As shown in Figures 15 and 16, participants complete two tasks (Influencer and Calibration), in which they watch short video clips and then rate their emotions using 1–9 Self-Assessment Manikin (SAM) scales for valence and arousal, as well as provide preference judgments about the videos. During these sessions, multiple synchronized modalities are recorded, including facial video, electrocardiography (ECG), electrodermal activity (EDA), eye tracking and screen activity, all time-aligned and stored in a structured metadata format organized by participant, task and modality. This design makes INEMO directly usable for machine learning and deep learning models and positions it as a bridge between traditional lab-based affective datasets and more realistic social media scenarios.
The image depicts an experiment protocol with two tasks. Task 1 involves watching influencer video clips in three sets, each with three videos, and using the Self-Assessment Manikin (SAM) to gauge reactions. Participants also rank preferences for individuals and others. Task 2 includes emotion calibration with videos evoking different emotions (amusement, tenderness, sadness, disgust, fear), followed by SAM assessments. The process ends with questionnaires. SAM measures valence (negative to positive) and arousal (calm to exciting). (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
The image shows a person sitting at a table with medical devices attached to their body. Electrodes are placed on their chest and stomach, connected by wires to a device strapped to their left wrist. Another device is strapped to their right wrist, with wires connected to electrodes on their right hand. The person appears to be in a medical or clinical setting, possibly undergoing a diagnostic or therapeutic procedure involving muscle or nerve activity monitoring. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
The image shows a person sitting at a table with medical devices attached to their body. Electrodes are placed on their chest and stomach, connected by wires to a device strapped to their left wrist. Another device is strapped to their right wrist, with wires connected to electrodes on their right hand. The person appears to be in a medical or clinical setting, possibly undergoing a diagnostic or therapeutic procedure involving muscle or nerve activity monitoring. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
To evaluate the dataset and illustrate its potential for multimodal emotion recognition, classical machine learning models (SVM, Random Forest, Gradient Boosting) were trained on handcrafted features extracted from ECG and EDA, with and without video features, and compared to a multimodal MVP-based (i.e., Multimodal for Video and Physio) baseline that jointly integrates ECG, EDA and facial video. The best results are obtained with a Gradient Boosting model using the combined ECG+EDA+Video configuration, reaching weighted F1-scores of about 0.78 for valence and 0.76 for arousal, and accuracies up to 0.80 for valence and 0.70 for arousal. These results confirm that the INEMO signals are informative and that the associated classification tasks are learnable, while still leaving room for more advanced multimodal modeling approaches.
8.16 EEG Classification with Limited Data: A Deep Clustering Approach.
Participants: Mohsen Tabejamaat, Farhood Negin, Francois Bremond.
The computer vision community has explored dyadic interactions for atomic actions such as pushing, carrying-object, etc. However, with the advancement in deep learning models, there is a need to explore more complex dyadic situations such as loose interactions. These are interactions where two people perform certain atomic activities to complete a global action irrespective of temporal synchronization and physical engagement, like cooking-together for example. Analyzing these types of dyadic-interactions has several useful applications in the medical domain for social-skills development and mental health diagnosis. To achieve this, we propose a novel dual-path architecture to capture the loose interaction between two individuals. Our model learns global abstract features from each stream via a CNNs backbone and fuses them using a new Global-Layer-Attention module based on a cross-attention strategy. We evaluate our model on real-world autism diagnoses such as our Loose-Interaction dataset, and the publicly available Autism dataset for loose interactions. Our network achieves baseline results on the Loose-Interaction and SOTA results on the Autism datasets. Moreover, we study different social interactions by experimenting on a publicly available dataset i.e. NTU-RGB+D (interactive classes from both NTU-60 and NTU-120). We have found that different interactions require different network designs. We also compare a slightly different version of our method by incorporating time information to address tight interactions achieving SOTA results.
This work has been published in Pattern Recognition 2025 34.
8.17 MEPHESTO: Multimodal Phenotyping of Psychiatric Disorders from Social Interaction
Participants: Michal Balazia, Aowen Shi, Miriana Russo, Francois Bremond.
Identifying objective and reliable markers to tailor diagnosis and treatment of psychiatric patients remains a challenge, as conditions like major depression, bipolar disorder, or schizophrenia are qualified by complex behavior observations or subjective self-reports instead of easily measurable somatic features. Recent progress in computer vision, speech processing and machine learning has enabled detailed and objective characterization of human behavior in social interactions. However, the application of these technologies to personalized psychiatry is limited due to the lack of sufficiently large corpora that combine multimodal measurements with longitudinal assessments of patients covering more than a single disorder. Our multi-centre, multi-disorder longitudinal corpus creation effort MEPHESTO is designed to develop and validate novel multimodal markers for psychiatric conditions. MEPHESTO consists of multimodal audio, video, and physiological recordings as well as clinical assessments of psychiatric patients covering a six-week main study period as well as several follow-up recordings spread across twelve months.
Diagnoses include schizophrenia, depression and bipolar disorder. Dataset does not include control subjects. Each patient is contributing with 1–8 videos, roughly 5.5 videos on average. In addition to video, the recordings include patients' and clinicians' biosignals electrodermal activity (EDA), blood volume pulse (BVP), inter-beat interval (IBI), heart rate, temperature, and accelerometer. Videos are recorded by Azure Kinect and biosignals by Empatica. People do not wear face masks while being recorded, although to minimize the transmission of COVID-19 there is a large transparent plexi-glass. Dataset is confidential, but many patients agreed to publish their raw or anonymized data for research purposes. Figure 17 shows a screenshot from a mock recording.
The image shows two people sitting in different rooms. Each person has a set of physiological data displayed below them. The data includes temperature, EDA (electrodermal activity), BVP (blood volume pulse), ACC (accelerometer), and HR (heart rate). The person on the left is seated near a window and wears a black and white striped shirt. The person on the right sits near a bookshelf and wears a black top. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
This year, we have made three major contributions regarding therapeutic alliance, recognizing depression and schizophrenia, and detecting childhood trauma from speech. These contributions are explained in detail in the subsections below.
8.17.1 Contextualized Synchrony for Therapeutic Alliance
Non-verbal behavioral synchrony has been widely studied as an indicator of relational dynamics in clinical interactions and has been shown to exhibit weak to moderate associations with therapeutic alliance (TA). However, most existing synchrony measures are computed in a content-agnostic manner, implicitly assuming that synchrony occurring at different moments of an interaction contributes equally to the development of the therapeutic relationship. This work is motivated by the hypothesis that the relational meaning of synchrony is context-dependent, and that linguistic content may play a critical role in determining when non-verbal coordination is most relevant to therapeutic alliance. In our setting, TA is assessed at the end of each session via a seven-item patient questionnaire capturing liking, perceived helpfulness, feeling understood and supported, and ease of sharing personal information, with the global TA score obtained by averaging item responses. By integrating semantic information derived from spoken language with non-verbal synchrony measures, this study aims to move beyond global, uniform synchrony metrics toward a more fine-grained, context-sensitive understanding of therapist–patient interaction dynamics. Non-verbal synchrony was computed at the window level using Motion Energy Analysis (MEA, see Figure 18 for an example of patient–therapist MEA time series) and a cross-correlation framework applied to the continuous motion energy time series of patient and therapist.
The image is a line graph titled "Motion Energy Analysis (MEA): Patient vs. Therapist." It depicts standardized motion energy on the y-axis versus time on the x-axis. The graph compares the motion energy of a patient and a therapist, represented by blue and orange lines, respectively. The patient's motion energy shows smaller, more frequent fluctuations, while the therapist's energy exhibits larger, more sporadic peaks. Both lines show significant activity at the beginning and end of the time period. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
We evaluate all models by predicting session-level TA scores and using Pearson’s correlation coefficient between predicted and observed TA as the primary outcome measure, computed in a session-level cross-validation setting. We first replicated a stable baseline association between global MEA synchrony and patient-reported TA, with a content-agnostic aggregation over all windows yielding a correlation of approximately . Building on this foundation, transcript data were processed into semantic embeddings and temporally aligned with synchrony windows, enabling a multimodal representation in which textual context modulates how window-level synchrony is aggregated over time. In the current implementation, not all MEA windows have a corresponding text segment, so windows without aligned transcripts are ignored when applying text-informed weighting. Evaluating a uniform (all-ones) aggregation under this constraint leads to a reduced MEA-TA association of , compared to the obtained when all MEA windows are used. Within this constrained evaluation setting, however, our text-informed weighting scheme increases the correlation to , suggesting that linguistic information helps to highlight synchrony segments that are more informative about alliance. While the overall performance of this preliminary implementation does not yet surpass the full-window MEA baseline, the results support the view that synchrony is not uniformly informative throughout an interaction and highlight the potential of window-level, context-aware multimodal modeling combined with improved textual coverage for capturing subtle relational processes in therapeutic settings.
8.17.2 Psychiatric Diagnosis Classification through Temporal Behavioral Analysis
This sub-project focuses on automated psychiatric diagnosis through multimodal behavioral analysis of clinical interview videos, with the objective of distinguishing between depression and schizophrenia. We utilize a portion of the MEPHESTO dataset of 34 patients: 25 with depression and 9 with schizophrenia. The dataset includes manual behavioral annotations provided by expert clinical annotators who labeled over 3000 video segments with observable behaviors. The implemented system (see Figure 19) follows a 7-stage pipeline: (1) input data acquisition from MEPHISTO with pre-annotated transcriptions, (2) low-level extraction using OpenFace 3.0 (8 Action Units: AU01, AU02, AU04, AU06, AU07, AU12, AU14, AU45 + gaze + head pose + 8 emotions), MediaPipe holistic (33 pose, 42 hand, 468 face landmarks), and Whisper for speech (1,842 features/frame), (3) temporal alignment with frame-level synchronization (±1 frame precision, 33ms), (4) multi-scale windowing (5s, 10s, 30s windows, 50% overlap) extracting 188 features across 24,588 windows, (5) temporal variability aggregation computing 6 statistics per feature (mean, standard deviation, coefficient of variation, minimum, maximum, range), (6) feature selection via ANOVA F-test selecting top 20 features (70% speech-based, 30% visual), and (7) classification with random forest using leave-one-out cross-validation across 13 tested methods.
The image depicts a process for diagnosing psychiatric conditions (Depression vs. Schizophrenia) using a baseline random forest model with feature fusion. It involves extracting multi-modal features from patient interview videos using three pipelines: OpenFace 3.0 for facial actions and gaze, MediaPipe for body and hand movements, and Whisper with speech analysis for speech features. These features are temporally windowed, fused, and statistically analyzed. Feature selection is performed using ANOVA F-test, reducing the dataset to 20 features. A random forest classifier is trained and validated using leave-one-out cross-validation, achieving 94.1% accuracy. A confusion matrix and top feature importance are displayed, highlighting the most influential features for diagnosis. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
Random forest achieves 94.1% accuracy with only two schizophrenia patients misclassified. Top discriminative feature is the standard deviation of patient's incomplete utterances. During our experiments, we found that temporal variability is the critical discriminative marker, that speech features dominate (70%) in the top-20 features, that feature fusion outperforms modality separation, and that traditional machine learning beats deep learning on small datasets. In the future, we are going to focus on temporal trauma detection in the long untrimmed clinical interviews.
8.17.3 Childhood Trauma Affects Speech and Language Measures in Patients with Major Depressive Disorder during Clinical Interviews
Speech analysis has shown significant promise as a potential biomarker for depression. However, no studies to date have examined the impact of childhood trauma on speech and language patterns in individuals with depression 32. This study aims to explore the relationship between vocal characteristics and depressive symptoms, while also assessing how childhood trauma may shape these patterns. 27 participants with a major depressive episode were included. The severity of depression was assessed using the Montgomery & Asberg Depression Rating Scale (MADRS) and the Beck Depression Inventory II. Childhood trauma was measured using the Childhood Trauma Questionnaire. Speech recordings from the MADRS semi-structured interview and a free clinical interview were analyzed using speaker diarization, automatic speech recognition, and feature extraction.
Several acoustics features were significantly associated with depression severity. Correlation analysis revealed that greater depression severity was linked to shorter, less diverse speech, characterized by fewer words, fewer semantic clusters, and reduced articulatory effort. In contrast, childhood trauma was positively associated with distinct speech characteristics. Higher trauma load was associated with richer, longer, and more syntactically complex speech. Additionally, utterances were shorter, with more frequent shifts between semantic clusters, reflecting a more fragmented speech pattern influenced by traumatic load. Our study highlights the influence of childhood trauma on vocal and linguistic characteristics of patients with depression. Automated language analysis offers the possibility to identify biomarkers of traumatic load in patients with depression. This could improve diagnostic accuracy, guide therapeutic management and monitor clinical progress.
8.18 MultiMediate'25: Cross-Cultural Multi-domain Engagement Estimation
Participants: Michal Balazia, Francois Bremond.
Estimating momentary conversational engagement is central to assistive, socially aware AI systems, yet models are typically trained and evaluated within a single domain, limiting real-world robustness. The MultiMediate'25 challenge 47 advances engagement estimation to more challenging, cross-cultural, and multi-domain settings. Building on prior challenge editions, we expand beyond NOXI and MPIIGroupInteraction (see Figure 20) as the sole training source by introducing NOXI-J, a new multilingual corpus covering Japanese and Chinese interactions, enabling both training and evaluation in diverse linguistic contexts. Although NOXI-J conceptually extends NOXI, we treat it as a distinct domain because linguistic, cultural, capture, and annotation differences induce measurable distribution shifts. MultiMediate'25 continues all previously defined tasks and creates another task: Cross-cultural Multi-domain Engagement Estimation.
In this work, we present new annotations, precomputed multi-modal features (visual, vocal, and verbal), baseline evaluations, and an analysis of the best performing challenge solutions. Beyond accuracy, we quantify fairness using conditional demographic disparity for gender and language. Our baselines confirm strong in-domain performance (e.g., paralinguistic eGeMAPS and video-transformer features) and reveal notable cross-domain drops, underscoring the challenge of cultural, linguistic, and interactional shifts. Fairness analyses indicate generally small discrepancies for our baselines. We observe the largest disparities for the proposed challenge solutions on the Chinese language test set. All annotations, features, code, and leaderboards are made publicly available to foster sustained progress on robust and fair engagement estimation.
The image consists of three side-by-side photos of a man in different poses. In the first photo, he is talking on a phone, in the second he is standing relaxed, and in the third he is gesturing with his hands. Below each photo, there is a graphical representation indicating some form of measured data, possibly volume or sound levels. Each graph has a red vertical line and a yellow-shaded area with varying black line patterns. (Description generated at January 14th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
The image consists of three side-by-side photos of a man in different poses. In the first photo, he is talking on a phone, in the second he is standing relaxed, and in the third he is gesturing with his hands. Below each photo, there is a graphical representation indicating some form of measured data, possibly volume or sound levels. Each graph has a red vertical line and a yellow-shaded area with varying black line patterns. (Description generated at January 14th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
As training datasets, we provide NOXI and NOXI-J to our participants. NOXI is a corpus of dyadic, screen-mediated face-to-face interactions in an expert-novice knowledge sharing context. In a session, one participant assumes the role of an expert and the other participant the role of a novice. NOXI includes interactions recorded at three locations (France, Germany and UK), spoken in seven languages (English, French, German, Spanish, Indonesian, Arabic and Italian), discussing a wide range of topics. The languages Indonesian, Arabic, Spanish, and Italian serve as an out-of-domain evaluation set. NOXI is extended by NOXI-J consisting of 66 dyadic interactions and over 16 hours of material using the same setup as original NOXI. NOXI-J features 48 interactions in Japanese with native Japanese speakers and 18 interactions in Chinese with Chinese native speakers. See Table 1 for the train-validation-test split.
| Training Data | Validation Data | Test Data |
| NOXI | NOXI | NOXI |
| English (23), French (7), German (8) | English (3), French (4), German (3) | English (6), French (6), German (4) |
| NOXI (additional test languages) | ||
| Arabic (2), Italian (2), Indonesian (4), Spanish (4) | ||
| MPIIGroupInteraction | MPIIGroupInteraction | |
| German (6) | German (6) | |
| NOXI-J | NOXI-J | NOXI-J |
| Japanese (21), Chinese (10) | Japanese (6), Chinese (4) | Japanese (6), Chinese (4) |
The task is frame-wise prediction of each interlocutor's engagement on a continuous scale . Accuracy is measured with the Concordance Correlation Coefficient (CCC), ranging from to . Participants are free to use the provided labeled data for training and validation and undergo in-domain and out-of-domain evaluations on NoXI, NoXI-J, NoXI (Additional Languages), and MPIIGroupInteraction. We provide a multi-modal set of precomputed features to participants. From the audio signal, we provide transcripts generated with the Whisper model. Additionally, we supply GeMAPS features along with wav2vec 2.0 embeddings. From the video, we provide the backbone embeddings of Video Swin Transformer, DINOv2, CLIP and VideoMAEv2 and the outputs of OpenFace and OpenPose to cover facial as well as body behaviors.
8.19 Stress Estimation in Dancers for Injury Prevention
Participants: Dian-Wei Lai, Quentin Merilleau, Aowen Shi, Francois Bremond.
Detecting stress in dancers is important, as high stress levels are often related to fatigue and injuries, which can negatively affect both performance and health. However, stress detection itself is not an easy task. This becomes even more challenging when using indirect and non-invasive data such as video. Although video is one of the most commonly available modalities, extracting reliable stress information from it remains highly challenging.
In this work, we investigate automatic stress estimation from dance videos using a small, weakly labeled dataset collected from professional dancers at Université Côte d’Azur. Each dancer performs the same dance under three different difficulty levels and in different scenes. The dataset currently includes 84 dancers, with two camera views (front view and diagonal view). Each video is approximately 1 to 2 minutes long. Data collection is still ongoing to further enrich the dataset and improve the reliability of the stress score distributions, PDF and CDF curves in Figure 21.
The image shows a study with 84 dancers performing three exercises of varying difficulty: Easy (Exercise 1), Intermediate (Exercise 2), and Hard (Exercise 3). The right side displays histograms and cumulative distribution functions (CDFs) of stress levels assessed by judges for each exercise. The histograms show the distribution of stress scores, while the CDFs illustrate the cumulative probabilities of these scores. Stress levels appear to increase with the difficulty of the exercise. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
To obtain meaningful results with limited data, we leverage pretrained models trained on large-scale video and motion datasets to improve feature representations. We then study the contribution of different visual modalities, including RGB, skeleton poses extracted using different methods with richer joint (see Figure 22) and hand motion information, depth, and optical flow. By analyzing each modality separately and in combination, we aim to build a robust multi-modal pipeline for stress estimation and to identify which modalities and movement cues are most informative for effective stress prediction.
The image contains a series of four photos showing a person in a dynamic pose, with different colored lines and dots overlaid on the person's body. These lines and dots appear to represent a pose estimation or skeletal tracking system, mapping key points such as the head, shoulders, elbows, wrists, hips, knees, and ankles. The person is dressed in a long-sleeved shirt and pants. The background shows a room with a partially visible doorway and paintings on the wall. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.20 Emotion recognition using Deep Learning
Participants: Valeriya Strizhkova, Antitza Dantcheva, Francois Bremond.
Understanding human emotions is crucial in healthcare, human-robot interaction, and marketing. Despite the progress in emotion recognition from one modality, such as a facial video and a sequence of physiological signals, it is still challenging to improve by combining multiple modalities. Moreover, it is difficult to recognize emotions in long sequential data, such as long videos, although most real-world videos of people expressing emotions are long. Existing emotion datasets are limited in volume and quality, making it difficult to develop an effective deep learning-based emotion recognition system. An effective real-world emotion understanding system should be able to recognize emotions from long videos synchronized with multiple modalities. In this thesis 52, we focus on multimodal emotion recognition from long videos synchronized with physiological signals. Specifically, multimodal emotion methods face three main challenges: (a) learning the emotion representation, (b) learning the representation of fine-grained emotions, as well as (c) combining modalities to predict emotions. In this thesis, we first introduce two large behavior analysis datasets: INEMO and StressID. INEMO is a multimodal dataset designed to facilitate emotion recognition from watching social media videos. StressID is a multimodal dataset designed for stress identification. Secondly, we propose two pre-training techniques for facial expression recognition: (1) supervised pre-training on synthetic data generated by our video generation method and (2) self-supervised pre-training on multi-view videos. We show that the proposed pre-training techniques allow us to get rich facial representations, which allow us to improve fine-grained emotion recognition accuracy. Thirdly, we tackle the problem of emotion recognition from multiple modalities. We propose a framework for multimodal fusion of videos and physiological signals to predict emotions. This framework consists of mainly two steps: (1) extracting features from long raw videos and physiological signals; (2) fusing extracted features to predict emotions using a cross-modality approach based on attention mechanism. Our methods leverage the additional modalities resulting in better emotion recognition performance. Our methods have been extensively evaluated on various emotion recognition benchmarks. The proposed methods outperform previous methods, significantly pushing emotion recognition to real-world deployments.
8.21 Identifying Surgical Instruments in Pedagogical Cataract Surgery Videos through an Optimized Aggregation Network
Participants: Sanya Sinha, Michal Balazia, Francois Bremond.
Instructional cataract surgery videos are crucial for ophthalmologists and trainees to observe surgical details repeatedly. In 44, we present a deep learning model for real-time identification of surgical instruments in these videos, using a custom dataset scraped from open-access sources. Inspired by the architecture of YOLOv9, the model employs a Programmable Gradient Information (PGI) mechanism and a novel Generally-Optimized Efficient Layer Aggregation Network (Go-ELAN) to address the information bottleneck problem, enhancing Minimum Average Precision (mAP) at higher Non-Maximum Suppression Intersection over Union (NMS IoU) scores.
Go-ELAN YOLOv9 Architecture (see Figure 23) contains an auxiliary block which works on the Programmable Gradient Information (PGI) concept by creating an auxiliary reverse branch for enabling reliable gradient calculation by avoiding potential semantic loss. The GELAN block in the backbone feature extractor is replaced by the Go-ELAN block proposed in this paper. The Spatial Pyramid Pooling block SPPELAN removes the fixed size limitation of the backbone. The ADown block downsamples the generated feature maps to target sizes. The CBLinear blocks extract higher level features from the images, and the CBFuse block fuses these extracted features. The Neck combines the acquired features and the Head predicts the final bounding bound outputs with their respective probabilities.
The image depicts a neural network architecture with sections labeled Auxiliary, Backbone, Neck, and Head. The Backbone starts with a Silence layer, followed by convolutional (Conv) layers, multiple Go-ELAN blocks, and ADown layers, ending with an SPPELAN block. The Auxiliary section mirrors parts of the Backbone with convolutional layers, Go-ELAN blocks, and ADown layers, and includes CBFuse and CBLinear blocks connecting to the Backbone. Both the Auxiliary and Backbone sections feed into the Neck, which connects to the final Head section, consisting of Detect blocks for making predictions. Connections between blocks are indicated by arrows showing data flow. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
Our Go-ELAN YOLOv9 model, evaluated against YOLO v5, v7, v8, v9 vanilla, Laptool and DETR, achieves a superior mAP of 73.74 at IoU 0.5 on a dataset of 615 images with 10 instrument classes, demonstrating the effectiveness of the proposed model. To illustrate the visual and qualitative superiority of our model, we have compared 12 ground-truth images with their respective model predictions in Figure 24.
The image shows a collection of medical procedure photos, likely from eye surgeries. Each image is labeled with various surgical tools such as speculum, cannula, forceps, hook, phacoprobe, and keratome. The tools are highlighted with colored boxes and labels, indicating their position and type. The images are arranged in a grid format, displaying different stages of the procedure. The eye is opened with a speculum, and various tools are used for precise surgical actions. The photos include confidence levels for the identified tools. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.22 TBDM: Temporal Boundary Distillation Module for Surgical Gesture Segmentation
This work was funded by 3IA Côte d'Azur.
Participants: Ezem Sura Ekmekci, Snehashis Majhi, Khodor Hamadi, Francois Bremond.
In 2025, in collaboration with CHU Nice and Caranx Medical, a novel framework for surgical gesture segmentation was developed that addresses the challenging problem of precise temporal localization during surgical action transitions. This work introduces the Temporal Boundary Distillation Module (TBDM), an innovative approach that explicitly models temporal boundaries between surgical gestures using RGB-only video data (see Figure 25). The framework employs knowledge distillation to learn boundary-aware features during training through cross-attention mechanisms, while requiring no additional computational overhead at inference. TBDM was validated on two major surgical datasets (CholecT50 and RARP-45), demonstrating consistent improvements across multiple baseline architectures, with up to +8.5 edit score improvement on CholecT50. Notably, the approach achieved state-of-the-art performance on RARP-45 (81.4 edit score, 77.9 F1@50), establishing TBDM as a generalizable, plug-and-play solution for fine-grained surgical workflow analysis. This work has been submitted to IPCAI 2026.
Additionally, a comprehensive evaluation of YOLOv8 for real-time surgical instrument recognition in robot-assisted and laparoscopic surgeries was conducted 33. Using a diverse multi-source dataset of over 7,400 frames and 17,175 annotations, the model achieved a mean average precision of 0.77 for binary detection and 0.72 for multi-instrument classification across seven instrument types. The segmentation performance demonstrated excellent accuracy with a mean Dice score of 0.91 and mean intersection over union of 0.86. With an inference speed of 1.12 milliseconds per frame, the model shows strong potential for real-time clinical applications in surgical workflow analysis and instrument tracking.
The image depicts a machine learning framework for gesture recognition. The process starts with a sequence of video frames input into a pre-trained VideoMAE-v2 model with frozen parameters. The extracted features are then passed to a projection layer and a temporal model for prediction. Additionally, there is a temporal boundary distillation module that only operates during training. This module uses cross-attention mechanisms and class presence maps to aggregate gesture class information. This module helps in refining the model's decision boundaries through a distillation loss calculated between the projection features and the distilled boundary features. The framework aims to improve gesture recognition accuracy by leveraging temporal information and class distinctions. (Description generated at January 19th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.23 Effective Video Feature Extraction for Training and Comprehension: Human-Centered Multimodal Video
Participants: Tanay Agrawal, Antitza Dantcheva, Francois Bremond.
Understanding actions in videos is a crucial element of computer vision, with significant implications in many fields. Given our increasing reliance on visual data, understanding and interpreting human actions in videos are becoming essential for developing technologies in surveillance, healthcare, autonomous systems, and human-computer interaction. Accurate interpretation of actions in videos is fundamental to creating intelligent systems capable of navigating and responding effectively to the complexities of the real world. In this context, advances in action understanding are pushing the boundaries of computer vision and playing a crucial role in the development of cutting-edge applications that impact our daily lives.
Computer vision has seen significant progress thanks to the rise of deep learning methods such as convolutional neural networks (CNNs) and transformers, pushing the boundaries of computer vision and enabling the computer vision community to advance in many areas, including image segmentation, object detection, scene understanding, and more. However, video processing remains limited compared to static images. In this thesis, we focus on video understanding, dividing it into two main parts: video classification and action detection, and their application in affective computing, particularly in interaction-based scenarios. In this thesis, we explore efficient learning approaches for video feature extraction in various video classification and interaction understanding tasks. Our contributions 51 cover the computation of intermediate-level features for faster convergence, plugin adaptation for handling diverse datasets and modalities, and evolutionary temporal modeling for understanding long videos. We begin by improving personality and behavior recognition through geometry-based behavioral coding and segmentation-driven attention mechanisms. We then address the challenges of modality availability and data diversity using knowledge distillation and a novel adapter-based cross-learning framework that generalizes to all tasks. Finally, we tackle the analysis of long videos for temporal action detection using temporal adapters with image models, as well as modular adapters and a two-stage spatiotemporal learning strategy with a video basis. Together, this work contributes to building generalizable and efficient learning systems for a wide range of video understanding applications.
8.24 Rotation-Induced Centroid Shift in Latent Space
Participants: Benoit Lagadec, Matthieu Saumard, Francois Bremond.
Convolutional neural networks are not rotation-equivariant in practice: discrete image rotation requires interpolation and zero-padding, making the rotation operator non-invertible and causing convolution and rotation to not commute. We show that this leads to a systematic and measurable shift of the feature-space centroid when images are rotated, even when the model is trained with standard rotation augmentation. We formalize this centroid drift analytically and verify it empirically. To mitigate this effect, we introduce a set of angle-specialized Exponential Moving Average (EMA) teachers that provide stable feature anchors at different rotation angles, optionally enhanced with low-rank angle adapters. This approach directly suppresses rotation-induced centroid shift and significantly improves feature consistency and classification accuracy under rotation, outperforming both classical augmentation and mean-teacher baselines while requiring minimal additional computation. We formalize discrete in-plane rotation on pixel grids as a degraded permutation and show why convolution and rotation do not commute. In this work, we empirically confirm that the centroid of feature representations shifts under rotation. Many studies are dedicated to find invariant in detection. An illustration is detailed in Figure 26.
This image illustrates a series of transformations applied to a grid of pixels. 1. **Original Grid**: A small grid with green highlighted pixels. 2. **Add Padding**: The grid is expanded with black padding around it. 3. **Rotation 45 Degrees**: The grid is rotated by 45 degrees within the padded area. 4. **Real/Numerical Transformation**: The rotated grid is transformed into a new format while retaining its pixel structure. 5. **Flatten Operation (Vectorization)**: Both the original and transformed grids are flattened into a linear vector format. 6. **Re-ordering**: Pixels are re-ordered, with some pixels not affected by the transformation. 7. **Limited Permutation**: The vectors are permuted to demonstrate how a rotation translates to a simple permutation of pixels. 8. **Final Sub-Space**: After keeping only specific pixels, a sub-space of the image is shown, illustrating that rotation can be represented by a simple permutation of pixels. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
This image illustrates a series of transformations applied to a grid of pixels. 1. **Original Grid**: A small grid with green highlighted pixels. 2. **Add Padding**: The grid is expanded with black padding around it. 3. **Rotation 45 Degrees**: The grid is rotated by 45 degrees within the padded area. 4. **Real/Numerical Transformation**: The rotated grid is transformed into a new format while retaining its pixel structure. 5. **Flatten Operation (Vectorization)**: Both the original and transformed grids are flattened into a linear vector format. 6. **Re-ordering**: Pixels are re-ordered, with some pixels not affected by the transformation. 7. **Limited Permutation**: The vectors are permuted to demonstrate how a rotation translates to a simple permutation of pixels. 8. **Final Sub-Space**: After keeping only specific pixels, a sub-space of the image is shown, illustrating that rotation can be represented by a simple permutation of pixels. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
Convolutional neural networks are often assumed to be robust to rotations when trained with rotation augmentations. However, this assumption overlooks a key property of real image rotations: discrete rotation on a pixel grid is implemented using interpolation and padding, making the operation non-invertible and causing it to not commute with convolution. As a result, rotating an input image and then extracting features is not equivalent to extracting features and then rotating them. We show that this mismatch induces a systematic and predictable shift in the feature-space centroid across rotation angles, even when the network is trained with extensive rotation augmentation.
This observation reframes rotation robustness as a problem of representation geometry, rather than data diversity alone. If rotation induces angle-dependent sub-clusters in feature space, enforcing global consistency (e.g., with a single Mean Teacher model) can suppress meaningful structure and lead to underfitting. We therefore propose a simple alternative: a set of angle-specialized EMA teachers that provide stable feature targets at different rotation angles, coupled with a feature-space centroid alignment loss that prevents rotation-induced drift without collapsing intra-class variability. Our approach is architecture-agnostic, computationally lightweight, and complementary to standard training pipelines. It improves rotation robustness in both classification and detection settings without requiring group-equivariant architectures or spatial transformer modules. The core contribution of this work is to characterize the geometric effect of discrete rotation in CNN feature space and to introduce a training strategy that explicitly stabilizes this geometry.
Applied to detection (see Figure 26), our method ensures that rotation-induced feature sub-clusters remain compact and aligned. This contrasts with our former work, which uses a related mechanism in person re-identification to enlarge inter-cluster separation, whereas our objective is to preserve sub-cluster coherence.
8.25 Dual Volume Skeleton-Guided 3D Face Reconstruction from Sparse Views
Participants: Benoit Lagadec, Seongro Yoon, Francois Bremond.
Reconstructing high-fidelity 3D face meshes from sparse 2D inputs is challenging due to limited depth cues and structural ambiguity. We present a skeleton-guided, dual-volume diffusion framework for reconstructing editable, high-fidelity 3D face meshes from only two sparse views, see Figure 27. By integrating part-level latent diffusion with skeleton-based conditioning and symmetry-aware dual-volume packing, our approach preserves pose-consistent geometry, enables part-aware editing, and maintains bilateral alignment. A teacher–student strategy with multi-view consistency further improves stability and fidelity, yielding significant gains over state-of-the-art baselines. Our contributions:
• A skeleton-conditioned diffusion pipeline that injects explicit structural priors to improve pose-consistent geometry under sparse views.
• A dual-volume latent representation, inspired by bipartite packing, enabling part-aware decoding and preventing fusion of contacting parts. It allows to complete a partial view in final face generation.
• A symmetry-aware objective coupling reconstruction accuracy and bilateral regularization for realistic midline geometry.
• A self supervised teacher–student strategy enhances multi-view consistency.
The image depicts a process for generating a 3D mesh from a 2D input image of a person's face. The process begins with facial landmark detection. These key points are used to construct a 3D skeleton, which is encoded into a clip. Using a dual U-Net-based diffusion model with skeleton conditioning, two 3D single view generations are created. These views are decoded into volumes (left and right) and combined using marching cubes with part mesh to produce the final 3D mesh. Loss functions like mutual contrastive loss and symmetry losses are applied during the process to ensure accuracy and symmetry. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
The image depicts a process for generating a 3D mesh from a 2D input image of a person's face. The process begins with facial landmark detection. These key points are used to construct a 3D skeleton, which is encoded into a clip. Using a dual U-Net-based diffusion model with skeleton conditioning, two 3D single view generations are created. These views are decoded into volumes (left and right) and combined using marching cubes with part mesh to produce the final 3D mesh. Loss functions like mutual contrastive loss and symmetry losses are applied during the process to ensure accuracy and symmetry. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
Given two input images (frontal and profile), we detect 2D landmarks to form a facial skeleton. Here landmarks are replaced by facial skeleton to produce more realistic generation in diffusion. A skeleton encoder produces a latent embedding that conditions a dual-UNet diffusion backbone via adaptive normalization. The denoiser outputs two latent volumes (left/right), which are decoded by a 3D VAE into SDF (i.e., static and dynamic factorization) /occupancy grids. Marching Cubes extracts meshes per side; parts remain disjoint via dual-volume packing, see Figure 28. A symmetry loss regularizes left/right consistency. A complete view of architecture is defined in Figure 27.
The image displays three 3D models of a human head and shoulders. The first model is entirely red with a hat. The second model is entirely blue with a hat. The third model has the head in red and the shoulders and upper body in blue. All models are set against a grey grid background. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.26 Turbo Learning: 3D Face Reconstruction with Mesh Re-Projection and Re-Identification Consistency
Participants: Benoit Lagadec, Francois Bremond.
We introduce Turbo Learning, a two-stage iterative refinement framework for 3D face reconstruction inspired by the positive-feedback dynamics of turbocharged engines. Traditional pipelines rely on sparse supervisory cues such as 2D landmarks, limiting their ability to recover accurate geometry. Our approach instead uses self-generated 3D meshes as progressively stronger priors: Stage 1 predicts a coarse mesh guided by MediaPipe landmarks, while Stage 2 uses this mesh as dense geometric supervision.
To further enhance identity preservation, we introduce a Mesh Re-Projection and Re-Identification Consistency Loss. By re-projecting meshes from both stages into image space and applying an InfoNCE contrastive Re-ID objective, we enforce identity stability across refinement steps. The combination of a geometric turbo loop and an identity turbo loop produces reconstructions that are more stable, more detailed, and more identity-faithful.
We compare Turbo Learning to classical iterative strategies such as EM, diffusion-based refinement, boosting, and teacher–student systems, and show that it occupies a distinctive position among them, see Fig. 29.
The image depicts a two-step process for a machine learning framework aimed at improving 3D mesh reconstruction. In Step 1, the input includes face and profile images, generating depth, mask, and normal outputs. These are processed through a dual uncertainty block and re-projected to calculate re-projection loss and re-ID contrastive loss. Step 2 builds on this by refining the 3D mesh guided by the initial output and additional ground truth mesh, further minimizing re-projection loss. The dual uncertainty block plays a central role in both steps, ensuring accurate depth and geometric information. (Description generated at January 15th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
8.27 THEval: Evaluation Framework for Talking Head Video Generation
Participants: Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva.
Generative models for talking head videos have witnessed remarkable progress, achieving high-resolution and realistic results. However, evaluating these models remains a significant challenge, as the rapid advancement in generation has outpaced the development of adequate metrics. Current evaluations primarily rely on general image quality metrics or lip-synchronization scores, which often fail to capture essential aspects of realism such as motion quality, temporal coherence, and naturalness. Furthermore, these existing metrics have been shown to correlate poorly with human preferences, necessitating a more robust and perceptually aligned evaluation approach.
Overview of the THEval scheme.
We introduce THEVAL 56, a novel evaluation framework designed to address these limitations by aligning closely with human perception, a visual summary of the framework is available on Figure 30. We support this framework with a new, challenging evaluation dataset comprising over 5,000 videos sourced from diverse YouTube channels, ensuring the content was unseen during model training to test generalization. The dataset features a wide range of languages, head poses, and expressions. To assess performance comprehensively, we decompose evaluation into three core dimensions: quality, naturalness, and synchronization, utilizing eight fine-grained metrics to analyze dynamics such as lip and head motion alongside global aesthetics.
To validate our framework, we conduct an extensive benchmark of 17 state-of-the-art audio- and video-driven models, generating and analyzing over 85,000 videos. We leverage a user study to demonstrate that our final composite score achieves a strong Spearman correlation of 0.870 with human ratings, significantly outperforming traditional metrics like FID and Syncnet. By applying this pipeline, we identify that while many current algorithms excel in lip synchronization, they continue to face challenges in generating expressive facial behavior and artifact-free details, establishing THEVAL as a vital tool for fostering future progress in the field.
8.28 Beyond Real versus Fake Towards Intent-Aware Video Analysis
Participants: Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva.
The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus primarily on distinguishing real from fake videos, such approaches fail to address a fundamental question regarding the intent behind manipulated content. With the proliferation of AI-generated media, the binary distinction of authenticity is becoming less relevant than understanding whether content is malicious or benign. This shift necessitates a new paradigm in video analysis that moves beyond artifact detection to the contextual understanding of underlying motivations.
Three-Way Contrastive Alignment Pipeline
We introduce IntentHQ 53, a new benchmark for human-centered intent analysis designed to formalize the task of intent recognition. We curate a comprehensive dataset of 5,168 videos, meticulously annotated with 23 fine-grained intent categories such as "Financial fraud", "Political propaganda", and "Comedy", organized under five broader dimensions including Deception and Persuasion. To effectively analyze these videos, we propose a novel self-supervised learning framework (see Figure 31) that leverages a three-way contrastive alignment strategy. This method jointly aligns video, audio, and textual modalities, utilizing data augmentation techniques like semantic paraphrasing and text-to-speech synthesis to learn robust representations without relying on manual labels during pretraining.
To validate our approach, we benchmark intent recognition using various state-of-the-art multimodal architectures. Our proposed model, which integrates spatio-temporal video features with audio and text analysis, achieves a classification accuracy of 52.5%, establishing a new state-of-the-art by significantly outperforming standard video classification baselines such as VideoMAE and TimeSFormer. Ablation studies further reveal that, while video remains the most predictive modality, the fusion of text and audio is essential for distinguishing complex, socially embedded intents. By releasing the IntentHQ dataset and code, we aim to foster further research in intent-aware media analysis, shifting the focus towards a more nuanced understanding of digital content.
8.29 AI killed the video star. Audio-driven diffusion model for expressive talking head generation
Participants: Baptiste Chopin, Antitza Dantcheva.
We proposed Dimitra++ 55, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we proposed a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggested that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose. Code and qualitative results are provided on our project page: Project Page.
8.30 LIA-X: Interpretable Latent Portrait Animator
Participants: Antitza Dantcheva, François Brémond.
We introduce LIA-X 57, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous 'warp-render' approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable 'edit-warp-render' strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation. Project Page
8.31 Simplicity-Bias-Aware Adaptation of Foundation Models for Deepfake Detection
Participants: Charbel Yahchouchi, Noemi Roggero, Laurent Saroul, Antitza Dantcheva.
Given the rapid advancement of deep learning and generative models, the synthesis of realistic and plausible images and videos has reached unprecedented levels. However, this accessibility also raises serious concerns, as such content can be misused for malicious purposes such as identity impersonation, misinformation, and social manipulation. Consequently, deepfake detection has emerged as a crucial area of research, aiming to develop robust and generalizable detectors capable of reliably identifying manipulated media. Despite impressive progress, most current detectors struggle to generalize to unseen manipulation, limiting their real-world reliability.
The image depicts a deep learning model designed for detecting fake or real images. The process starts by converting an input image into patch embeddings. These embeddings go through a series of Transformer and Adapter layers within the CLIP ViT backbone. The output is then split into two paths: one leading to an auxiliary classification head (SiFeR Head) and the other to the main classification head. Both heads use linear classifiers to determine if the image is fake or real. The auxiliary head calculates a specific loss function that includes both auxiliary and forget loss, while the main head calculates the main loss. The model aims to improve the detection of manipulated images by using these dual classification heads. (Description generated at January 14th, 2026 by Albert AI with the model Mistral-Small-3.2-24B)
In this work, we study the limitations of foundation model adaptation for deepfake detection under distribution shifts and address the impact of shortcut learning induced by parameter-efficient fine-tuning for deepfakes. We introduce a simplicity-bias-aware adaptation framework, see Fig. 32, that augments a frozen CLIP visual encoder with lightweight adapter modules and integrates the SIFER feature-sieving mechanism to identify and suppress simple but non-generalizable cues during training. To validate our framework, we conduct an extensive evaluation on recent state-of-the-art deepfake detection datasets, focusing on cross-dataset and cross-manipulation generalization under distribution shifts. Experimental results show consistent improvements in video-level Area Under the Curve (AUC) compared to CLIP-based baselines and other parameter-efficient adaptation strategies, with particularly strong gains on subtle and localized manipulations.
8.32 Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos
Participants: Anil Egin, Antitza Dantcheva.
Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework 39 referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. Project Page
8.33 Beyond the visible: A survey on cross-spectral face recognition
Participants: Antitza Dantcheva.
Cross-spectral face recognition (CFR) refers to recognizing individuals using face images stemming from different spectral bands, such as infrared versus visible. While CFR is inherently more challenging than classical face recognition due to significant variation in facial appearance caused by the modality gap, it is useful in many scenarios including night-vision biometrics and detecting presentation attacks. Recent advances in deep neural networks (DNNs) have resulted in significant improvement in the performance of CFR systems. Given these developments, the contributions of this survey are three-fold. First, we provide an overview of CFR, by formalizing the CFR problem and presenting related applications. Secondly, we discuss the appropriate spectral bands for face recognition and discuss recent CFR methods, placing emphasis on deep neural networks. In particular, we describe techniques that have been proposed to extract and compare heterogeneous features emerging from different spectral bands. We also discuss the datasets that have been used for evaluating CFR methods. Finally, we discuss the challenges and future lines of research on this topic.
This work has been published in Neurocomputing 31.
9 Bilateral contracts and grants with industry
Participants: Antitza Dantcheva, Francois Bremond.
Stars team has currently several experiences in technological transfer towards industries, which have permitted to exploit research result.
9.1 Bilateral contracts with industry
9.1.1 Toyota
This project runs from the 1st of August 2013 up to December 2025. It aims at detecting critical situations in the daily life of older adults living home alone.
Toyota is working with Stars on action recognition software to be integrated on their robot platform. This project aims at detecting critical situations in the daily life of older adults alone at home. This will require not only recognition of ADLs but also an evaluation of the way and timing in which they are being carried out. The system we want to develop is intended to help them and their relatives to feel more comfortable because they know that potentially dangerous situations will be detected and reported to caregivers if necessary. The system is intended to work with a Partner Robot - HSR - (to send real-time information to the robot) to better interact with the older adult.
9.1.2 Fantastic Sourcing
Fantastic Sourcing is a French SME specialized in micro-electronics, it develops e-health technologies. Fantastic Sourcing is collaborating with Stars through the Univ. Côte d'Azur Solitaria project, by providing their Nodeus system. Nodeus is an IoT (Internet of Things) system for home support for the elderly, which consists of a set of small sensors (without video cameras) to collect precious data on the habits of isolated people. Solitaria project performs a multi-sensor activity analysis for monitoring and safety of older and isolated people. With the increase of the ageing population in Europe and in the rest of the world, keeping elderly people at home, in their usual environment, as long as possible, becomes a priority and a challenge of modern society. A system for monitoring activities and alerting in case of danger, in permanent connection with a device (an application on a phone, a surveillance system ...) to warn relatives (family, neighbors, friends ...) of isolated people still living in their natural environment could save lives and avoid incidents that cause or worsen the loss of autonomy. In this RD project, we propose to study a solution allowing the use of a set of innovative heterogeneous sensors in order to: 1) detect emergencies (falls, crises, etc.) and call relatives (neighbors, family, etc.); 2) detect, over short or longer predefined periods of time.
9.1.3 Probayes
STARS will be working with Probayes starting 01/07/2025 within a CIFRE Ph.D. on the development of advanced methods for detecting artificially generated videos using artificial intelligence models. Recent advances in image and video generation based on neural networks make it possible to create highly realistic fake videos of individuals (deepfakes), which raises major security concerns for many organizations. This project aims at designing innovative approaches to assess the authenticity of video content. A particular emphasis will be placed on developing techniques that are generalizable and not specific to a given video generation model or application context. The proposed methods will rely on the analysis of spatio-temporal behavioral cues, such as mouth dynamics, in order to evaluate the veracity of video sequences.
10 Partnerships and cooperations
Participants: François Brémond, Antitza Dantcheva, Michal Balazia, Monique Thonnat.
10.1 International research visitors
10.1.1 Visits of international scientists
Other international visits to the team
Participant: Donghyeon Cho.
-
Status
Associate Professor
-
Institution of origin:
Ulsan National Institute of Science and Technology (UNIST)
-
Country:
South Korea
-
Dates:
July to August
-
Context of the visit:
Collaborations
-
Mobility program/type of mobility:
Korean research stay.
Participant: Jinsun Park.
-
Status
Associate Professor
-
Institution of origin:
Pusan National University in Busan
-
Country:
South Korea
-
Dates:
July to August
-
Context of the visit:
Collaborations
-
Mobility program/type of mobility:
Korean research stay.
Participant: Seungryul Baek.
-
Status
Associate Professor
-
Institution of origin:
Hanyang University in Seoul
-
Country:
South Korea
-
Dates:
July to August
-
Context of the visit:
Collaborations
-
Mobility program/type of mobility:
Korean research stay.
Participant: Eric Granger.
-
Status
Full Professor
-
Institution of origin:
École de technologie supérieure, Université du Québec
-
Country:
Canada
-
Dates:
November to December
-
Context of the visit:
Collaborations
-
Mobility program/type of mobility:
sabbatical.
Participant: Nesli Erdogmus.
-
Status
Assistant Professor
-
Institution of origin:
Izmir Institute of Technology
-
Country:
Turkey
-
Dates:
July to August
-
Context of the visit:
Collaborations
-
Mobility program/type of mobility:
Franco - Turkish Research Fellowship Program "Prestij"
10.2 European initiatives
10.2.1 Horizon Europe
GAIN
GAIN project on cordis.europa.eu
-
Title:
Georgian Artificial Intelligence Networking and Twinning Initiative
-
Duration:
From October 1, 2022 to September 30, 2025
-
Partners:
- Institut National De Recherche En Informatique Et Automatique (INRIA), France
- Exolaunch GMBH (EXO), Germany
- Deutsches Forschungszentrum Fur Kunstliche Intelligenz GMBH (DFKI), Germany
- Georgian Technical University (GTU), Georgia
-
Inria contact:
François Bremond
-
Coordinator:
George Giorgobiani
-
Summary:
GAIN will take a strategic step towards integrating Georgia, one of the Widening countries, into the system of European efforts aimed at ensuring the Europe’s leadership in one of the most transformative technologies of today and tomorrow – Artificial Intelligence (AI). It will be achieved by research profile adjusting and linking the central Georgian ICT research institute - Muskhelishvili Institute of Computational Mathematics (MICM), to the European AI research and innovation community. Two absolutely leading European research organizations (DFKI and INRIA) supported by the high-tech company EXOLAUNCH will support MICM in this endeavor. The Strategic Research and Innovation Programme (SRIP) designed by the partnership will provide the environment for the Georgian colleagues to get involved in the research projects of the European partners addressing a clearly delineated set of AI topics. Jointly, the partners will advance in capacity building and networking within the area of AI Methods and Tools for Human Activities Recognition and Evaluation, which also will contribute to strengthening core competences in such fundamental technologies as e.g. Machine (Deep) Learning. The results of the cooperation presented through the series of scientific publications and events will inform the European AI community about the potential of MICM and trigger new partnerships building, addressing e.g. Horizon Europe. The project will contribute to career development of a cohort of young researchers at MICM through joint supervision and targeted capacity building measures. Innovation and Research Administration and Management capacities of MICM will also be strengthened to allow the Institute to be better connected to the local, regional and European innovation activities. Using their extensive research and innovation networking capacities DFKI and INRIA will introduce MICM to the European AI research community by connecting to such networks as CLAIRE, ELLIS, ADRA, AI NoEs, etc.
10.3 National initiatives
ANR COMSEMA
Website: ANR COMSEMA
-
Title:
Communications Sémantiques pour les futurs réseaux - Semantic Communications for future networks
-
Duration:
From November 1, 2024 to October 30, 2028
-
Partners:
- Institut National De Recherche En Informatique Et Automatique (INRIA), France
- Centrale-Supelec
- Orange
-
Inria contact:
François Bremond
-
Coordinator:
Mohamad Assaad
-
Summary:
The ANR COMSEMA project, part of Thematiques Specifiques en Intelligence Artificielle (TSIA), from November 1 2024 up to October 30 2028 (48 months) aims to improve future networks incorporating video interpretation applications. Wireless networks are currently witnessing a radical shift from a purely data-oriented architecture to service and intelligent-based architectures, allowing hence the support of a diverse set of verticals. Thanks to the development of AI, future networks are expected to incorporate an even larger set of applications and services such as ReID applications and human activity recognition, interactive hologram, e-health, intelligent humanoid robot, etc. In this project, we consider video interpretation applications and propose a fundamental semantic-approach to redesign the entire process of information generation and transmission over the network. In particular, novel AI-based interference management that focuses on the task achievement, rather than the bit rate improvement over the air interface, will be investigated. Inria is in charge of customizing video interpretation applications to improve data transmission over the network. INRIA Grant is about 196 keuros (24 Person Months) out of 560 keuros.
-
Title:
Interpretable Representation Learning for Video GANs
-
Duration:
From 2024 to 2028
-
Partners:
- Inria Center at Université Côte d'Azur, France
-
Inria contact:
Antitza Dantcheva
-
Coordinator:
Antitza Dantcheva
-
Summary:
The Inria Exploratory Action (Aex) XGAN, from 2024 to 2028 aims at piercing the black box of generative models for video generation by proposing strategies to interpret the latent space in (a) designing interpretable architectures, and by (b) analyzing symmetric functions in input and output of patch-based generation. Despite remarkable progress in generative models, such networks operate currently as black boxes. INRIA Grant is about 170 keuros.
10.4 Regional initiatives
Since 2011, we have initiated a strategic partnership (called CobTek) with Nice hospital (CHU Nice, Prof F. Askenazy) to start ambitious research activities dedicated to healthcare monitoring and assistive technologies. These new studies address the analysis of more complex spatiotemporal activities (e.g. complex interactions, long term activities).
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organization
General chair, scientific chair:
Participant: François Brémond.
François Brémond was:
- General Chair at IPAS 2025 [130 people], the IEEE International Conference on Image Processing, Applications and Systems (website), Lyon, January 2025. Member of the organizing committee (see 50).
- General Chair at the South Caucasus Conference on Artificial Intelligence - SCCAI 2025, MICM/GTU, Tbilisi, Georgia, September 16-18, 2025.
Member of the organizing committees:
Participant: Antitza Dantcheva, Michal Balazia.
Antitza Dantcheva was co-organizer of CV4BIOM, the Workshop on Computer Vision for Biometrics, Identity & Behaviour associated to the International Conference on Computer Vision (ICCV 2025) on October 20th, 2025.
She was also co-organizer of the 4th Vision-based Remote Physiological Signal Sensing (RePSS) workshop in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI 2025) on August 28th, 2025.
Michal Balazia was in the technical committee of SCCAI 2025, as well as session chair. He also was session chair, program chair, and member of the organizing technical committee at ACMMM MultiMediate 2025.
11.1.2 Scientific events: selection
Participants: François Brémond, Antitza Dantcheva, Michal Balazia, Monique Thonnat.
Reviewer:
- François Brémond was reviewer in major Computer Vision / Machine Learning conferences, including ICCV, ECCV, CVPR, NeurIPS, AAAI, ICLR, WACV.
- Monique Thonnat was a member of conference program committee IJCAI-2025 and ICPRAM 2026.
-
Antitza Dantcheva was reviewer and evaluator for SMASH, a Slovenian career-development training program.
Further she served as reviewer for major Computer Vision / Machine Learning conferences such as ICCV, CVPR, NeurIPS, AAAI, ICLR, WACV.
- Michal Balazia was in 2025 reviewer for ACMMM MultiMediate, ACM Multimedia, ICPR, and WACV.
11.1.3 Journal
Michal Balazia served as reviewer for TBIOM and MDPI Sensors.
11.1.4 Invited talks
Participants: François Brémond, Monique Thonnat, Antitza Dantcheva, Michal Balazia.
Francois Bremond gave the following invited talks:
- invited talk (1h) at IPAS 2025, IEEE International Conference on Image Processing, Applications and Systems IPAS Website, Lyon, January 9-11, 2025.
- invited talk (1h) at the South Caucasus Conference on Artificial Intelligence - SCCAI 2025, MICM/GTU, Tbilisi, Georgia, September 16-18, 2025.
- invited talk (1h) on "Video Action Recognition" at the University of Bristol, on 21 October 2025.
- Keynote speaker at the ePictureThis workshop on "Video Action Recognition for Human Behavior Analysis", TU-Eindhoven, on 28 October 2025.
Monique Thonnat was invited as keynote speaker in the IEEE ICPRS conférence in Vina del Mar Chile, December 1-4, 2025.
Antitza Dantcheva gave the following invited talks.
- invited talk in the Storyzy premises in Paris, May 7, 2025.
- invited talk in the online US Seminar on "US Developments and Impact of AI on Biometric Vulnerabilities", June 26, 2025.
- invited talk at the Workshop for "Synthetic Realities and Biometric Security: Advances in Forensic Analysis and Threat Mitigation (SRBS)", November 27, associated to BMVC.
- invited talk at SophIA Summit in Sophia Antipolis, November 20, 2025.
- invited talk at the University of Technology in Vienna (TU Wien), Austria, December 4, 2025.
Michal Balazia gave invited talks at Metascience and Guardians.
11.1.5 Contributed talks
Monique Thonnat attended as speaker the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) at Tucson Arizona February 28 - March 4 2025.
11.1.6 Scientific expertise
Participants: Monique Thonnat, Michal Balazia.
Monique Thonnat evaluated ANR projects in the framework of comité d’évaluation “CE38 – Interfaces : mathématiques, sciences du numérique – sciences humaines et sociales".
Michal Balazia served as reviewer for ANR and NSERC.
11.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
Participant: François Brémond.
Francois Bremond held AI courses on Computer Vision Deep Learning for the Data Science and AI - MSc program at Université Côte d'Azur: Teaching Website. Academic year 2025: 24 hours.
11.2.1 Supervision
Participants: François Brémond, Antitza Dantcheva, Michal Balazia.
Francois Bremond (co)-supervised 11 PhD students and many master's students:
- Tomasz Stanczyk: 3IA PhD student
- Valeriya Strizhkova: 3IA PhD student, defended on March 14, 2025, 52.
- Seongro Yoon: 3IA PhD student
- Tanay Agrawal: PhD student - Fellowship from European project Gain, defended on September 26, 2025, 51.
- Abid Ali: PhD student - Fellowship from BoostUrCAreer CoFund
- Snehashis Majhi: PhD student - Fellowship from Toyota
- Aglind Reka: Fellowship EUR Spectrum, Geoazur - Intelligent Mapping
- Ezem Ekmekci: 3IA PhD student
- Wenxin Xiong: Gredeg PhD student
- Yuan Gao: INRAE PhD student
- Sébastien Frey: Nice Hospital PhD student.
Francois Bremond was part of the supervision of several internship students (master & PhD) that have been hosted at the STARS team.
Antitza Dantcheva (co)-supervised 5 PhD students and many master's students:
- Valeriya Strizhkova: 3IA PhD student
- Tanay Agrawal: PhD student - Fellowship from European project Gain
- Snehashis Majhi: PhD student - Fellowship from Toyota
- Nabyl Quignong: Inria AEX XGAN PhD
- Charbel Yahchouchi: CIFRE PhD Probayes
- Baptiste Chopin: Inria AEX XGAN Postdoc
- Anil Egin: Masters Student
Michal Balazia supervised the following researchers.
- M2 interns: Aaryan Dhawan, Miriana Russo, Sanya Sinha
- engineer: Aowen Shi
- pre-docs: Quentin Merilleau, Aglind Reka.
11.2.2 Juries
Participants: François Brémond, Antitza Dantcheva.
Francois Bremond participated in the following juries:
- HDR:
- Carlos Crispim from Université Lumière - Lyon 2, September 22, 2025
- PhD Thesis Review:
- Nima Mehdi from Inria Centre at Université de Lorraine, December 17, 2024
- Kevin Flanagan from the University of Bristol, October 22, 2025
- Samy Tafasca from École Polytechnique Fédérale de Lausanne - EPFL, December 5, 2025
- Salvatore Fiorilla from Università di Bologna, December 11, 2025
- CSI - Comité de suivi de thèse:
- Marc Chapus, May 5, 2025
- Keqi Chen, May 21, 2025
- Monica Fossati, May 26, 2025
- Federica Facente, May 31, 2025
- Aela Le Sommer, June 3, 2025
- Franz Fabini Franco Gallo, June 10,2025
- Thomas Campagnolo, July 1, 2025
- Kaushik Bhowmik, July 1, 2025
- Sofia Alexopoulou, July 11, 2025
- Yannick Porto, July 9, 2025
- Idir Chatar, September 12, 2025
Antitza Dantcheva participated in the following juries.
- PhD Thesis Review:
- Sahar Husseini, Eurecom, June 17, 2025.
- CSI - Comité de suivi de thèse:
- Mehdi Atamna, December 8, 2025
- Huyen Trang Nguyen, October 21, 2025
- Yuanzhi Zhu, October 30, 2025
- Huyen Trang Nguyen, July 7, 2025
11.3 Popularization
11.3.1 Specific official responsibilities in science outreach structures
Participants: François Brémond, Antitza Dantcheva, Michal Balazia.
- Francois Bremond participated in the organization of the Sophia Summit 2025.
- Michal Balazia gave invited talks at Metascience on June 19, 2025.
- Michal Balazia gave invited talks at Guardians on October 9, 2025.
11.3.2 Productions (articles, videos, podcasts, serious games, ...)
Participant: Michal Balazia.
Michal Balazia made a demo visualization tool for action detection in videos of psychiatric interviews.
11.3.3 Participation in Live events
Participant: François Brémond.
Francois Bremond participated in the following events with following functions:
- Presentation on "Human Action Recognition", part of “Fête de la science” at the Village des sciences d'Antibes – Juan-les-Pins, on October 11, 2025;
- Presentation for bachelor students, ENS de Lyon, Sophia Antipolis, November 2025;
- Presentation for high school students, part of Terra Numerica, Sophia Antipolis, December 2025;
11.3.4 Other science outreach relevant activities
Participant: François Brémond.
Francois Bremond gave an interview on "Automated video surveillance" to Bachelor students from Sciences Po, in February 2025.
12 Scientific production
12.1 Major publications
- 1 inproceedingsFrom Multimodal to Unimodal Attention in Transformers using Knowledge Distillation.AVSS 2021 - 17th IEEE International Conference on Advanced Video and Signal-based SurveillanceVirtual, United StatesNovember 2021HALDOIback to textback to text
- 2 inproceedingsMultimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding.VISAPP '22: International Conference on Computer Vision Theory and Applicationsvirtual, United StatesIEEE; SCITEPRESS - Science and Technology PublicationsFebruary 2022, 501-508HALDOIback to textback to text
- 3 inproceedingsMultimodal Vision Transformers with Forced Attention for Behavior Analysis.WACV '23: IEEE International Winter Conference on Applications in Computer VisionWaikoloa, United StatesJanuary 2023HALDOIback to text
- 4 inproceedingsVideo-based Behavior Understanding of Children for Objective Diagnosis of Autism.VISAPP 2022 - 17th International Conference on Computer Vision Theory and ApplicationsOnline, FranceFebruary 2022HALback to textback to text
- 5 inproceedings Quo Vadis, Video Understanding with Vision-Language Foundation Models? NeurIPS Proceedings Neurips 2024 - 38th Annual Conference on Neural Information Processing Systems Vancouver (Canada), Canada December 2024 HAL
- 6 inproceedingsBodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation.MM'22: The 30th ACM International Conference on MultimediaLisbon, PortugalACM; ACMOctober 2022, 70-79HALDOIback to text
- 7 inproceedingsStressID: a Multimodal Dataset for Stress Identification.NeurIPS 2023 - 37th Conference on Neural Information Processing SystemsNew Orleans, United StatesDecember 2023HALback to text
- 8 articleEditorial: Recognizing the state of emotion, cognition and action from physiological and behavioral signals.Frontiers in Computer Science4August 2022HALDOIback to text
- 9 inproceedingsJoint Generative and Contrastive Learning for Unsupervised Person Re-identification.CVPR 2021 - IEEE Conference on Computer Vision and Pattern RecognitionVirtual, United StatesJune 2021HAL
- 10 articleLearning Invariance from Generated Variance for Unsupervised Person Re-identification.IEEE Transactions on Pattern Analysis and Machine IntelligenceDecember 2022, 1-15In press. HALDOI
- 11 articleSemantic Event Fusion of Different Visual Modality Concepts for Activity Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence382016, 1598 - 1611HALDOI
- 12 inproceedingsCTRN: Class Temporal Relational Network For Action Detection.BMVC 2021 - The British Machine Vision ConferenceVirtual, United KingdomNovember 2021HAL
- 13 inproceedingsLearning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection.ICCV 2021 - IEEE/CVF International Conference on Computer VisionMontreal, CanadaOctober 2021HALback to textback to text
- 14 inproceedingsMS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection.CVPR - Conference on Computer Vision and Pattern RecognitionNew Orleans, United StatesJune 2022HAL
- 15 inproceedingsPDAN: Pyramid Dilated Attention Network for Action Detection.WACV 2021 - Winter Conference on Applications of Computer Vision 2021Waikoloa / Virtual, United StatesJanuary 2021HALback to text
- 16 inproceedingsAAN : Attributes-Aware Network for Temporal Action Detection.BMVC 2023 - The 34th British Machine Vision ConferenceAberdeen, United KingdomNovember 2023HAL
- 17 articleGender estimation based on smile-dynamics.IEEE Transactions on Information Forensics and Security2016, 11HALDOI
- 18 inproceedingsToyota Smarthome: Real-World Activities of Daily Living.ICCV 2019 -17th International Conference on Computer VisionSeoul, South KoreaOctober 2019HALback to text
- 19 articleVPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living.IEEE Transactions on Pattern Analysis and Machine IntelligenceDecember 2021HALDOIback to textback to text
- 20 inproceedingsVPN: Learning Video-Pose Embedding for Activities of Daily Living.ECCV 2020 - 16th European Conference on Computer VisionGlasgow (Virtual), United KingdomAugust 2020HAL
- 21 inproceedingsOne-class autoencoder approach for optimal electrode set-up identification in wearable EEG event monitoring.EMBC 2021 - 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology SocietyVirtuel, FranceOctober 2021HALDOIback to text
- 22 inproceedingsA Self-Supervised Pre-Training Framework for Vision-Based Seizure Classification.2022 IEEE International Conference on Acoustics, Speech, and Signal Processing proceedingsIEEE ICASSP 2022 : IEEE International Conference on Acoustics, Speech and Signal ProcessingSingapore, SingaporeMay 2022HALDOI
- 23 articleSynthetic Data in Human Analysis: A Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence2024, 1-20HALDOIback to text
- 24 articleMeasuring neuropsychiatric symptoms in patients with early cognitive decline using speech analysis.European Psychiatry6412021, e64HALDOIback to text
- 25 articleMultimodal phenotyping of psychiatric disorders from social interaction: Protocol of a clinical multicenter prospective study.Personalized Medicine in Psychiatry33-34May 2022, 100094HALDOIback to textback to textback to textback to text
- 26 inproceedingsFLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation.AVSS 2021 - 17th IEEE International Conference on Advanced Video and Signal-based SurveillanceVirtual, United StatesNovember 2021HALDOIback to text
- 27 inproceedingsEmotion Editing in Head Reenactment Videos using Latent Space Manipulation.2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)FG 2021 - IEEE International Conference on Automatic Face and Gesture RecognitionJodhpur, IndiaDecember 2021HALDOIback to text
- 28 inproceedingsG3AN: Disentangling Appearance and Motion for Video Generation.CVPR 2020 - IEEE Conference on Computer Vision and Pattern RecognitionSeattle / Virtual, United StatesJune 2020HAL
- 29 inproceedingsSelf-Supervised Video Representation Learning via Latent Time Navigation.Technical Tracks 3AAAI 2023 - AAAI Conference on Artificial Intelligence37Proceedings of the 37th AAAI Conference on Artificial Intelligence3Washigton, D.C., United StatesJune 2023HALDOI
- 30 articleFeasibility Study of an Internet-Based Platform for Tele-Neuropsychological Assessment of Elderly in Remote Areas.Diagnostics12April 2022HALDOIback to text
12.2 Publications of the year
International journals
International peer-reviewed conferences
Conferences without proceedings
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Reports & preprints