EN FR
Homepage Inria website


Section: Application Domains

Narrative Description of Kitchen Activities from Egocentric Video

We have developed and evaluated a system to construct situated, narrative descriptions of cooking activities including food preparation, place setting, cleaning and placing objects in storage areas. We are specifically interested in real-time, on-line techniques that recognize and interpret food types, food states and manipulation actions for transformation preparation of food. We are exploring techniques for detecting, modelling, and recognising a large vocabulary of actions and activities under different observational conditions, and describing these activities in a larger context.

A full understanding of human actions requires: recognising what action has been performed, predicting how it will affect the surrounding environment, explaining why this action has been performed, and who is performing it . Classic approaches to action recognition interpret a spatio-temporal pattern in a video sequence to tell what action has been performed, and perhaps how and where it was performed. A more complete understanding requires information about why the action was performed, and how it affects the environment. This face of understanding can be provided by explaining the action as part of a narrative.

Most work on recognition of cooking activities has concentrated on recognizing actions from the spatio-temporal patterns of hand motions. While some cooking activities may be directly recognized from motion, the resulting description is incomplete, as it does not describe the state of the ingredients, and how these have been transformed by cooking actions. A fuller description requires a description of how food ingredients have been transformed during the food preparation process.

We have addressed the automatic construction of cooking narratives by first detecting and locating ingredients and tools used in food preparation. We then recognize actions that involve transformations of ingredients, such as "slicing", and use these transformations to segment the video stream into visual events. We can then interpret detected events as a causal sequence of voluntary actions, confirmed by spatio-temporal transformation patterns, in order to automatically provide a narrative.

Our method is inspired by the intuition that object states are visually more apparent than actions from a still frame and thus provide information that is complementary to spatio-temporal action recognition. We define a state transition matrix that maps action labels into a pre-state and a post-state. We identify key frames, and use these to learn appearance models of objects and their states. For recognition, we use a modified form of VGG neural network trained via transfer learning with a specially constructed data set of images of food types and food sates. Manipulation actions are hypothesized from the state transition matrix and provide complementary evidence to spatio temporal action recognition.