<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <title>Project-Team:THOTH</title>
    <link rel="stylesheet" href="../static/css/raweb.css" type="text/css"/>
    <meta name="description" content="Research Program - Datasets and evaluation"/>
    <meta name="dc.title" content="Research Program - Datasets and evaluation"/>
    <meta name="dc.subject" content=""/>
    <meta name="dc.publisher" content="INRIA"/>
    <meta name="dc.date" content="(SCHEME=ISO8601) 2016-01"/>
    <meta name="dc.type" content="Report"/>
    <meta name="dc.language" content="(SCHEME=ISO639-1) en"/>
    <meta name="projet" content="THOTH"/>
    <script type="text/javascript" src="https://raweb.inria.fr/rapportsactivite/RA2016/static/MathJax/MathJax.js?config=TeX-MML-AM_CHTML">
      <!--MathJax-->
    </script>
  </head>
  <body>
    <div class="tdmdiv">
      <div class="logo">
        <a href="http://www.inria.fr">
          <img style="align:bottom; border:none" src="../static/img/icons/logo_INRIA-coul.jpg" alt="Inria"/>
        </a>
      </div>
      <div class="TdmEntry">
        <div class="tdmentete">
          <a href="uid0.html">Project-Team Thoth</a>
        </div>
        <span>
          <a href="uid1.html">Members</a>
        </span>
      </div>
      <div class="TdmEntry">
        <a href="./uid3.html">Overall Objectives</a>
      </div>
      <div class="TdmEntry">Research Program<ul><li><a href="uid8.html&#10;&#9;&#9;  ">Designing and learning structured models</a></li><li><a href="uid12.html&#10;&#9;&#9;  ">Learning of visual models from minimal supervision</a></li><li><a href="uid17.html&#10;&#9;&#9;  ">Large-scale learning and optimization</a></li><li class="tdmActPage"><a href="uid21.html&#10;&#9;&#9;  ">Datasets and evaluation</a></li></ul></div>
      <div class="TdmEntry">Application Domains<ul><li><a href="uid26.html&#10;&#9;&#9;  ">Visual applications</a></li></ul></div>
      <div class="TdmEntry">
        <a href="./uid33.html">Highlights of the Year</a>
      </div>
      <div class="TdmEntry">New Software and Platforms<ul><li><a href="uid41.html&#10;&#9;&#9;  ">CoNFab: COnvolutional Neural FABric</a></li><li><a href="uid42.html&#10;&#9;&#9;  ">Modl</a></li><li><a href="uid43.html&#10;&#9;&#9;  ">M-CNN: Weakly-Supervised Semantic Segmentation using Motion Cues</a></li><li><a href="uid44.html&#10;&#9;&#9;  ">DALY: Daily Action Localization in Youtube</a></li><li><a href="uid45.html&#10;&#9;&#9;  ">GUN-71</a></li><li><a href="uid46.html&#10;&#9;&#9;  ">Synthetic human 3D pose dataset</a></li></ul></div>
      <div class="TdmEntry">New Results<ul><li><a href="uid48.html&#10;&#9;&#9;  ">Visual recognition in images</a></li><li><a href="uid63.html&#10;&#9;&#9;  ">Visual recognition in videos</a></li><li><a href="uid74.html&#10;&#9;&#9;  ">Large-scale statistical learning</a></li></ul></div>
      <div class="TdmEntry">Bilateral Contracts and Grants with Industry<ul><li><a href="uid82.html&#10;&#9;&#9;  ">MSR-Inria joint lab: scientific image and video mining</a></li><li><a href="uid83.html&#10;&#9;&#9;  ">MSR-Inria joint lab: structured large-scale machine learning</a></li><li><a href="uid84.html&#10;&#9;&#9;  ">Amazon</a></li><li><a href="uid85.html&#10;&#9;&#9;  ">Google</a></li><li><a href="uid86.html&#10;&#9;&#9;  ">Facebook</a></li><li><a href="uid87.html&#10;&#9;&#9;  ">MBDA</a></li><li><a href="uid88.html&#10;&#9;&#9;  ">Xerox Research Center Europe</a></li></ul></div>
      <div class="TdmEntry">Partnerships and Cooperations<ul><li><a href="uid90.html&#10;&#9;&#9;  ">Regional Initiatives</a></li><li><a href="uid92.html&#10;&#9;&#9;  ">National Initiatives</a></li><li><a href="uid96.html&#10;&#9;&#9;  ">European Initiatives</a></li><li><a href="uid100.html&#10;&#9;&#9;  ">International Initiatives</a></li><li><a href="uid112.html&#10;&#9;&#9;  ">International Research Visitors</a></li></ul></div>
      <div class="TdmEntry">Dissemination<ul><li><a href="uid118.html&#10;&#9;&#9;  ">Promoting Scientific Activities</a></li><li><a href="uid176.html&#10;&#9;&#9;  ">Teaching - Supervision - Juries</a></li></ul></div>
      <div class="TdmEntry">
        <div>Bibliography</div>
      </div>
      <div class="TdmEntry">
        <ul>
          <li>
            <a id="tdmbibentyear" href="bibliography.html">Publications of the year</a>
          </li>
        </ul>
      </div>
    </div>
    <div id="main">
      <div class="mainentete">
        <div id="head_agauche">
          <small><a href="http://www.inria.fr">
	    
	    Inria
	  </a> | <a href="../index.html">
	    
	    Raweb 
	    2016</a> | <a href="http://www.inria.fr/en/teams/thoth">Presentation of the Project-Team THOTH</a> | <a href="http://thoth.inrialpes.fr/">THOTH Web Site
	  </a></small>
        </div>
        <div id="head_adroite">
          <table class="qrcode">
            <tr>
              <td>
                <a href="thoth.xml">
                  <img style="align:bottom; border:none" alt="XML" src="../static/img/icons/xml_motif.png"/>
                </a>
              </td>
              <td>
                <a href="thoth.pdf">
                  <img style="align:bottom; border:none" alt="PDF" src="IMG/qrcode-thoth-pdf.png"/>
                </a>
              </td>
              <td>
                <a href="../thoth/thoth.epub">
                  <img style="align:bottom; border:none" alt="e-pub" src="IMG/qrcode-thoth-epub.png"/>
                </a>
              </td>
            </tr>
            <tr>
              <td/>
              <td>PDF
</td>
              <td>e-Pub
</td>
            </tr>
          </table>
        </div>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid17.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid26.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
      <div id="textepage">
        <!--DEBUT2 du corps du module-->
        <h2>Section: 
      Research Program</h2>
        <h3 class="titre3">Datasets and evaluation</h3>
        <p>Standard benchmarks with associated evaluation measures are becoming
increasingly important in computer vision, as they enable an
objective comparison of state-of-the-art approaches.
Such datasets need to be relevant for real-world application scenarios;
challenging for state-of-the-art algorithms; and
large enough to produce statistically significant results.</p>
        <p>A decade ago, small datasets were used to evaluate relatively simple tasks, such
as for example interest point matching and detection. Since then, the size
of the datasets and the complexity of the tasks gradually evolved. An
example is the Pascal Visual Object Challenge with 20 classes
and approximately 10,000 images, which evaluates object classification and
detection. Another example is the ImageNet challenge, including thousands
of classes and millions of images. In the context of video classification,
the TrecVid Multimedia Event Detection challenges, organized by NIST,
evaluate activity classification on a dataset of over 200,000 video clips,
representing more than 8,000 hours of video, which amounts to 11 months of
continuous video.</p>
        <p>Almost all of the existing image and video datasets are annotated by hand;
it is the case for all of the above cited examples. In some cases, they
present limited and unrealistic viewing conditions. For example, many images
of the ImageNet dataset depict upright objects with virtually no background
clutter, and they may not capture particularly relevant visual concepts:
most people would not know the majority of subcategories of snakes cataloged
in ImageNet. This holds true for video datasets as well, where in addition
a taxonomy of action and event categories is missing.</p>
        <p>Our effort on data collection and evaluation will focus on two directions. First,
we will design and assemble video datasets, in particular for action and
activity recognition. This includes defining relevant taxonomies of
actions and activities. Second, we will provide data and
define evaluation protocols for weakly supervised learning methods. This does not mean of
course that we will forsake human supervision altogether: some amount
of ground-truth labeling is necessary for experimental validation and
comparison to the state of the art. Particular attention will be payed
to the design of efficient annotation tools.</p>
        <p>Not only do we plan to collect datasets, but also to provide them to the
community, together with accompanying evaluation protocols and software, to
enable a comparison of competing approaches for action recognition and
large-scale weakly supervised learning. Furthermore, we plan to set up
evaluation servers together with leader-boards, to establish an unbiased
state of the art on held out test data for which the ground-truth
annotations are not distributed. This is crucial to avoid tuning the
parameters for a specific dataset and to guarantee a fair evaluation.</p>
        <ul>
          <li>
            <p class="notaparagraph"><a name="uid22"> </a><b>Action recognition.</b> We will develop datasets for recognizing
human actions and human-object interactions (including multiple
persons) with a significant number of actions. Almost all of today's
action recognition datasets evaluate classification of short video
clips into a number of predefined categories, in many cases a number
of different sports, which are relatively easy to identify by their
characteristic motion and context. However, in many real-world
applications the goal is to identify and localize actions in entire
videos, such as movies or surveillance videos of several hours. The
actions targeted here are “real-world” and will be defined by
compositions of atomic actions into higher-level activities. One
essential component is the definition of relevant taxonomies of
actions and activities. We think that such a definition needs to rely
on a decomposition of actions into poses, objects and scenes, as
determining all possible actions without such a
decomposition is not feasible. We plan to provide annotations for
spatio-temporal localization of humans as well as relevant objects and
scene parts for a large number of actions and videos.</p>
          </li>
          <li>
            <p class="notaparagraph"><a name="uid23"> </a><b>Weakly supervised learning.</b> We will collect weakly labeled
images and videos for training. The collection process will be
semi-automatic. We will use image or video search engines such as
Google Image Search, Flickr or YouTube to find visual data
corresponding to the labels. Initial datasets will be obtained by
manually correcting whole-image/video labels, i.e., the approach will
evaluate how well the object model can be learned if the entire image
or video is labeled, but the object model has to be extracted
automatically. Subsequent datasets will features noisy and incorrect
labels. Testing will be performed on PASCAL VOC'07 and ImageNet, but
also on more realistic datasets similar to those used for training,
which we develop and manually annotate for evaluation. Our dataset
will include both images and videos, the categories represented will
include objects, scenes as well as human activities, and the data will
be presented in realistic conditions.</p>
          </li>
          <li>
            <p class="notaparagraph"><a name="uid24"> </a><b>Joint learning from visual information and text.</b> Initially, we
will use a selection from the large number of movies and TV series for
which scripts are available on-line, see for
example <a href="http://www.dailyscript.com">http://www.dailyscript.com</a> and
<a href="http://www.weeklyscript.com">http://www.weeklyscript.com</a>. These scripts can easily be aligned with
the videos by establishing correspondences between script words and
(timestamped) spoken ones obtained from the subtitles or audio track.
The goal is to jointly learn from visual content and text. To measure
the quality of such a joint learning, we will manually annotate some
of the videos. Annotations will include the space-time locations of
the actions as well as correct parsing of the sentence. While DVDs
will, initially, receive most attention, we will also investigate the
use of data obtained from web pages, for example images with captions,
or images and videos surrounded by text. This data is by nature more
noisy than scripts.</p>
          </li>
        </ul>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid17.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid26.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
    </div>
  </body>
</html>
