Section: Research Program
Human activity capture and classification
From a scientific point of view, visual action understanding is a computer vision problem that until recently has received little attention outside of extremely specific contexts such as surveillance or sports. Many of the current approaches to the visual interpretation of human activities are designed for a limited range of operating conditions, such as static cameras, fixed scenes, or restricted actions. The objective of this part of our project is to attack the much more challenging problem of understanding actions and interactions in unconstrained video depicting everyday human activities such as in sitcoms, feature films, or news segments. The recent emergence of automated annotation tools for this type of video data (Everingham, Sivic, Zisserman, 2006; Laptev, Marszałek, Schmid, Rozenfeld, 2008; Duchenne, Laptev, Sivic, Bach, Ponce, 2009) means that massive amounts of labelled data for training and recognizing action models will at long last be available. Our research agenda in this scientific domain is described below and our recent results are outlined in detail in Section 7.4 .
Weakly-supervised learning and annotation of human actions in video
We aim to leverage the huge amount of video data using readily-available annotations in the form of video scripts. Scripts, however, often provide only imprecise and incomplete information about the video. We address this problem with weakly-supervised learning techniques both at the text and image levels. To this end we have recently explored automatic mining of action categories and actor names from videos and corresponding scripts [6] . Within the PhD of Piotr Bojanowski and Jean-Baptiste Alayrac we extend this direction by modeling the temporal order of actions and developing models for learning key steps from instruction videos [20] .
Descriptors for video representation
Video representation has a crucial role for recognizing human actions and other components of a visual scene. Our work in this domain aims to develop generic methods for representing video data based on realistic assumptions. In particular, we develop deep learning methods and design new trainable representations for human action recognition. Such methods range from generic video-level representations based on space-time convolutional neural networks [27] to person-focused representations based on human pose [9] , [23] . We also address the tasks of person detection [18] , segmentation [4] , [26] and tracking [7] in challenging video data.