<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <title>Project-Team:WILLOW</title>
    <link rel="stylesheet" href="../static/css/raweb.css" type="text/css"/>
    <meta name="description" content="Bilateral Contracts and Grants with Industry - Google: Learning to annotate videos from movie scripts (Inria)"/>
    <meta name="dc.title" content="Bilateral Contracts and Grants with Industry - Google: Learning to annotate videos from movie scripts (Inria)"/>
    <meta name="dc.creator" content="Josef Sivic"/>
    <meta name="dc.creator" content="Ivan Laptev"/>
    <meta name="dc.creator" content="Jean Ponce"/>
    <meta name="dc.subject" content=""/>
    <meta name="dc.publisher" content="INRIA"/>
    <meta name="dc.date" content="(SCHEME=ISO8601) 2017-01"/>
    <meta name="dc.type" content="Report"/>
    <meta name="dc.language" content="(SCHEME=ISO639-1) en"/>
    <meta name="projet" content="WILLOW"/>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
      <!--MathJax-->
    </script>
  </head>
  <body>
    <div class="tdmdiv">
      <div class="logo">
        <a href="http://www.inria.fr">
          <img style="align:bottom; border:none" src="../static/img/icons/logo_INRIA-coul.jpg" alt="Inria"/>
        </a>
      </div>
      <div class="TdmEntry">
        <div class="tdmentete">
          <a href="uid0.html">Project-Team Willow</a>
        </div>
        <span>
          <a href="uid1.html">Personnel</a>
        </span>
      </div>
      <div class="TdmEntry">Overall Objectives<ul><li><a href="./uid3.html">Statement</a></li></ul></div>
      <div class="TdmEntry">Research Program<ul><li><a href="uid5.html&#10;&#9;&#9;  ">3D object and scene modeling, analysis,
and retrieval</a></li><li><a href="uid7.html&#10;&#9;&#9;  ">Category-level object and scene recognition</a></li><li><a href="uid8.html&#10;&#9;&#9;  ">Image restoration, manipulation and enhancement</a></li><li><a href="uid9.html&#10;&#9;&#9;  ">Human activity capture and classification</a></li></ul></div>
      <div class="TdmEntry">Application Domains<ul><li><a href="uid13.html&#10;&#9;&#9;  ">Introduction</a></li><li><a href="uid14.html&#10;&#9;&#9;  ">Quantitative image analysis in science and humanities</a></li><li><a href="uid15.html&#10;&#9;&#9;  ">Video Annotation, Interpretation, and Retrieval</a></li></ul></div>
      <div class="TdmEntry">
        <a href="./uid17.html">Highlights of the Year</a>
      </div>
      <div class="TdmEntry">New Software and Platforms<ul><li><a href="uid24.html&#10;&#9;&#9;  ">LOUPE</a></li><li><a href="uid29.html&#10;&#9;&#9;  ">object-states-action</a></li><li><a href="uid34.html&#10;&#9;&#9;  ">SURREAL</a></li><li><a href="uid39.html&#10;&#9;&#9;  ">UNREL</a></li><li><a href="uid44.html&#10;&#9;&#9;  ">BIOGAN</a></li><li><a href="uid48.html&#10;&#9;&#9;  ">KernelImageRetrieval</a></li><li><a href="uid53.html&#10;&#9;&#9;  ">SCNet</a></li><li><a href="uid58.html&#10;&#9;&#9;  ">CNNGeometric</a></li><li><a href="uid63.html&#10;&#9;&#9;  ">LSDClustering</a></li></ul></div>
      <div class="TdmEntry">New Results<ul><li><a href="uid69.html&#10;&#9;&#9;  ">3D object and scene modeling, analysis, and retrieval</a></li><li><a href="uid79.html&#10;&#9;&#9;  ">Category-level object and scene recognition</a></li><li><a href="uid88.html&#10;&#9;&#9;  ">Image restoration, manipulation and enhancement</a></li><li><a href="uid91.html&#10;&#9;&#9;  ">Human activity capture and classification</a></li></ul></div>
      <div class="TdmEntry">Bilateral Contracts and Grants with Industry<ul><li><a href="uid103.html&#10;&#9;&#9;  ">Facebook AI Research Paris: Weakly-supervised interpretation of image and video data (Inria)</a></li><li class="tdmActPage"><a href="uid104.html&#10;&#9;&#9;  ">Google: Learning to annotate videos from movie scripts (Inria)</a></li><li><a href="uid105.html&#10;&#9;&#9;  ">Google: Structured learning from video and natural language (Inria)</a></li><li><a href="uid106.html&#10;&#9;&#9;  ">MSR-Inria joint lab: Image and video mining for science and humanities (Inria)</a></li></ul></div>
      <div class="TdmEntry">Partnerships and Cooperations<ul><li><a href="uid108.html&#10;&#9;&#9;  ">National Initiatives</a></li><li><a href="uid110.html&#10;&#9;&#9;  ">European Initiatives</a></li><li><a href="uid113.html&#10;&#9;&#9;  ">International Initiatives</a></li><li><a href="uid117.html&#10;&#9;&#9;  ">International Research Visitors</a></li></ul></div>
      <div class="TdmEntry">Dissemination<ul><li><a href="uid123.html&#10;&#9;&#9;  ">Promoting Scientific Activities</a></li><li><a href="uid204.html&#10;&#9;&#9;  ">Teaching - Supervision - Juries</a></li><li><a href="uid233.html&#10;&#9;&#9;  ">Popularization</a></li></ul></div>
      <div class="TdmEntry">
        <div>Bibliography</div>
      </div>
      <div class="TdmEntry">
        <ul>
          <li>
            <a id="tdmbibentyear" href="bibliography.html">Publications of the year</a>
          </li>
        </ul>
      </div>
    </div>
    <div id="main">
      <div class="mainentete">
        <div id="head_agauche">
          <small><a href="http://www.inria.fr">
	    
	    Inria
	  </a> | <a href="../index.html">
	    
	    Raweb 
	    2017</a> | <a href="http://www.inria.fr/en/teams/willow">Presentation of the Project-Team WILLOW</a> | <a href="http://www.di.ens.fr/willow">WILLOW Web Site
	  </a></small>
        </div>
        <div id="head_adroite">
          <table class="qrcode">
            <tr>
              <td>
                <a href="willow.xml">
                  <img style="align:bottom; border:none" alt="XML" src="../static/img/icons/xml_motif.png"/>
                </a>
              </td>
              <td>
                <a href="willow.pdf">
                  <img style="align:bottom; border:none" alt="PDF" src="IMG/qrcode-willow-pdf.png"/>
                </a>
              </td>
              <td>
                <a href="../willow/willow.epub">
                  <img style="align:bottom; border:none" alt="e-pub" src="IMG/qrcode-willow-epub.png"/>
                </a>
              </td>
            </tr>
            <tr>
              <td/>
              <td>PDF
</td>
              <td>e-Pub
</td>
            </tr>
          </table>
        </div>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid103.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid105.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
      <div id="textepage">
        <!--DEBUT2 du corps du module-->
        <h2>Section: 
      Bilateral Contracts and Grants with Industry</h2>
        <h3 class="titre3">Google: Learning to annotate videos from movie scripts (Inria)</h3>
        <p class="participants"><span class="part">Participants</span> :
	Josef Sivic, Ivan Laptev, Jean Ponce.</p>
        <p>The goal of this project is to automatically generate annotations of complex dynamic events in video. We wish to deal with events involving multiple people interacting with each other, objects and the scene, for example people at a party in a house. The goal is to generate structured annotations going beyond simple text tags. Examples include entire text sentences describing the video content as well as bounding boxes or segmentations spatially and temporally localizing the described objects and people in video. This is an extremely challenging task due to large intra-class variation of human actions. We propose to learn joint video and text representations enabling such annotation capabilities from feature length movies with coarsely aligned shooting scripts. Building on our previous work in this area, we aim to develop structured representations of video and associated text enabling to reason both spatially and temporally about scenes, objects and people as well as their interactions. Automatic understanding and interpretation of video content is a key-enabling factor for a range of practical applications such as content-aware advertising or search. Novel video and text representations are needed to enable breakthrough in this area.</p>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid103.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid105.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
    </div>
  </body>
</html>
