Modern science such as agronomy, bio-informatics, astronomy and
environmental science must deal with overwhelming amounts of
experimental data produced through empirical observation and
simulation. Such data
must be processed (cleaned, transformed, analyzed) in all kinds of
ways in order to draw new conclusions, prove scientific theories and
produce knowledge. However, constant progress in scientific
observational instruments (e.g. satellites, sensors, large hadron
collider) and simulation tools (that foster in silico experimentation,
as opposed to traditional in situ or in vivo experimentation) creates
a huge data overload. For example, climate modeling data are growing
so fast that they will lead to collections of hundreds of exabytes
(
Scientific data is also very complex, in particular because of heterogeneous methods used for producing data, the uncertainty of captured data, the inherently multi-scale nature (spatial scale, temporal scale) of many sciences and the growing use of imaging (e.g. satellite images), resulting in data with hundreds of attributes, dimensions or descriptors. Processing and analyzing such massive sets of complex scientific data is therefore a major challenge since solutions must combine new data management techniques with large-scale parallelism in cluster, grid or cloud environments.
Furthermore, modern science research is a highly collaborative process, involving scientists from different disciplines (e.g. biologists, soil scientists, and geologists working on an environmental project), in some cases from different organizations distributed over different countries. Each discipline or organization tends to produce and manage its own data, in specific formats, with its own processes. Thus, integrating distributed data and processes gets difficult as the amounts of heterogeneous data grow.
Despite their variety, we can identify common features of scientific data: big data; manipulated through complex, distributed workflows; typically complex, e.g. multidimensional or graph-based; with uncertainty in the data values, e.g., to reflect data capture or observation; important metadata about experiments and their provenance; and mostly append-only (with rare updates).
Generic data management solutions (e.g. relational DBMS) which have proved effective in many application domains (e.g. business transactions) are not efficient for dealing with scientific data, thereby forcing scientists to build ad-hoc solutions which are labor-intensive and cannot scale. In particular, relational DBMSs have been lately criticized for their “one size fits all” approach. Although they have been able to integrate support for all kinds of data (e.g., multimedia objects, XML documents and new functions), this has resulted in a loss of performance and flexibility for applications with specific requirements because they provide both “too much” and “too little”. Therefore, it has been argued that more specialized DBMS engines are needed. For instance, column-oriented DBMSs, which store column data together rather than rows in traditional row-oriented relational DBMSs, have been shown to perform more than an order of magnitude better on decision-support workloads. The “one size does not fit all” counter-argument generally applies to cloud data management as well. Cloud data can be very large, unstructured (e.g. text-based) or semi-structured, and typically append-only (with rare updates). And cloud users and application developers may be in high numbers, but not DBMS experts. Therefore, current cloud data management solutions have traded consistency for scalability, simplicity and flexibility. As alternative to relational DBMS (which use the standard SQL language), these solutions have been quoted as Not Only SQL (NoSQL) by the database research community.
The three main challenges of scientific data management can be summarized by: (1) scale (big data, big applications); (2) complexity (uncertain, multi-scale data with lots of dimensions), (3) heterogeneity (in particular, data semantics heterogeneity). The overall goal of Zenith is to address these challenges, by proposing innovative solutions with significant advantages in terms of scalability, functionality, ease of use, and performance. To produce generic results, these solutions are in terms of architectures, models and algorithms that can be implemented in terms of components or services in specific computing environments, e.g. grid, cloud. To maximize impact, a good balance between conceptual aspects (e.g. algorithms) and practical aspects (e.g. software development) is necessary. We design and validate our solutions by working closely with scientific application partners (CIRAD, INRA, CEMAGREF, etc.). To further validate our solutions and extend the scope of our results, we also want to foster industrial collaborations, even in non scientific applications, provided that they exhibit similar challenges.
Data management is concerned with the storage, organization, retrieval and manipulation of data of all kinds, from small and simple to very large and complex. It has become a major domain of computer science, with a large international research community and a strong industry. Continuous technology transfer from research to industry has led to the development of powerful DBMSs, now at the heart of any information system, and of advanced data management capabilities in many kinds of software products (application servers, document systems, search engines, directories, etc.).
The fundamental principle behind data management is data independence, which enables applications and users to deal with the data at a high conceptual level while ignoring implementation details. The relational model, by resting on a strong theory (set theory and first-order logic) to provide data independence, has revolutionized data management. The major innovation of relational DBMS has been to allow data manipulation through queries expressed in a high-level (declarative) language such as SQL. Queries can then be automatically translated into optimized query plans that take advantage of underlying access methods and indices. Many other advanced capabilities have been made possible by data independence : data and metadata modeling, schema management, consistency through integrity rules and triggers, transaction support, etc.
This data independence principle has also enabled DBMS to continuously integrate new advanced capabilities such as object and XML support and to adapt to all kinds of hardware/software platforms from very small smart devices (smart phone, PDA, smart card, etc.) to very large computers (multiprocessor, cluster, etc.) in distributed environments.
Following the invention of the relational model, research in data management has continued with the elaboration of strong database theory (query languages, schema normalization, complexity of data management algorithms, transaction theory, etc.) and the design and implementation of DBMS. For a long time, the focus was on providing advanced database capabilities with good performance, for both transaction processing and decision support applications. And the main objective was to support all these capabilities within a single DBMS.
The problems of scientific data management (massive scale, complexity and heterogeneity) go well beyond the traditional context of DBMS. To address them, we capitalize on scientific foundations in closely related domains: distributed data management, cloud data management, big data, uncertain data management, metadata integration, data mining and content-based information retrieval.
To deal with the massive scale of scientific data, we exploit large-scale distributed systems, with the objective of making distribution transparent to the users and applications. Thus, we capitalize on the principles of large-scale distributed systems such as clusters, peer-to-peer (P2P) and cloud, to address issues in data integration, scientific workflows, recommendation, query processing and data analysis.
Data management in distributed systems has been traditionally achieved by distributed database systems which enable users to transparently access and update several databases in a network using a high-level query language (e.g. SQL) . Transparency is achieved through a global schema which hides the local databases' heterogeneity. In its simplest form, a distributed database system is a centralized server that supports a global schema and implements distributed database techniques (query processing, transaction management, consistency management, etc.). This approach has proved effective for applications that can benefit from centralized control and full-fledge database capabilities, e.g. information systems. However, it cannot scale up to more than tens of databases. Data integration systems, e.g. price comparators such as KelKoo, extend the distributed database approach to access data sources on the Internet with a simpler query language in read-only mode.
Parallel database systems extend the distributed database approach to improve performance (transaction throughput or query response time) by exploiting database partitioning using a multiprocessor or cluster system. Although data integration systems and parallel database systems can scale up to hundreds of data sources or database partitions, they still rely on a centralized global schema and strong assumptions about the network.
Scientific workflow management systems (SWfMS) such as Kepler
(http://
In contrast, peer-to-peer (P2P) systems adopt a completely decentralized approach to data sharing. By distributing data storage and processing across autonomous peers in the network, they can scale without the need for powerful servers. Popular examples of P2P systems such as Gnutella and BitTorrent have millions of users sharing petabytes of data over the Internet. Although very useful, these systems are quite simple (e.g. file sharing), support limited functions (e.g. keyword search) and use simple techniques (e.g. resource location by flooding) which have performance problems. To deal with the dynamic behavior of peers that can join and leave the system at any time, they rely on the fact that popular data get massively duplicated.
Initial research on P2P systems has focused on improving the performance of query routing in the unstructured systems which rely on flooding, whereby peers forward messages to their neighbors. This work led to structured solutions based on Distributed Hash Tables (DHT), e.g. CHORD and Pastry, or hybrid solutions with super-peers that index subsets of peers. Another approach is to exploit gossiping protocols, also known as epidemic protocols. Gossiping has been initially proposed to maintain the mutual consistency of replicated data by spreading replica updates to all nodes over the network. It has since been successfully used in P2P networks for data dissemination. Basic gossiping is simple. Each peer has a complete view of the network (i.e., a list of all peers' addresses) and chooses a node at random to spread the request. The main advantage of gossiping is robustness over node failures since, with very high probability, the request is eventually propagated to all nodes in the network. In large P2P networks, however, the basic gossiping model does not scale as maintaining the complete view of the network at each node would generate very heavy communication traffic. A solution to scalable gossiping is by having each peer with only a partial view of the network, e.g. a list of tens of neighbor peers. To gossip a request, a peer chooses at random a peer in its partial view to send it the request. In addition, the peers involved in a gossip exchange their partial views to reflect network changes in their own views. Thus, by continuously refreshing their partial views, nodes can self-organize into randomized overlays which scale up very well.
We claim that a P2P solution is the right solution to support the collaborative nature of scientific applications as it provides scalability, dynamicity, autonomy and decentralized control. Peers can be the participants or organizations involved in collaboration and may share data and applications while keeping full control over their (local) data sources.
But for very-large scale scientific data analysis or to execute very large data-intensive workflow activities (activities that manipulate huge amounts of data), we believe cloud computing (see next section), is the right approach as it can provide virtually infinite computing, storage and networking resources. However, current cloud architectures are proprietary, ad-hoc, and may deprive users of the control of their own data. Thus, we postulate that a hybrid P2P/cloud architecture is more appropriate for scientific data management, by combining the bests of both approaches. In particular, it will enable the clean integration of the users’ own computational resources with different clouds.
Cloud computing encompasses on demand, reliable services provided over the Internet (typically represented as a cloud) with easy access to virtually infinite computing, storage and networking resources. Through very simple Web interfaces and at small incremental cost, users can outsource complex tasks, such as data storage, system administration, or application deployment, to very large data centers operated by cloud providers. Thus, the complexity of managing the software/hardware infrastructure gets shifted from the users' organization to the cloud provider. From a technical point of view, the grand challenge is to support in a cost-effective way the very large scale of the infrastructure which has to manage lots of users and resources with high quality of service.
Cloud customers could move all or part of their information technology (IT) services to the cloud, with the following main benefits:
Cost. The cost for the customer can be greatly reduced since the IT infrastructure does not need to be owned and managed; billing is only based only on resource consumption. For the cloud provider, using a consolidated infrastructure and sharing costs for multiple customers reduces the cost of ownership and operation.
Ease of access and use. The cloud hides the complexity of the IT infrastructure and makes location and distribution transparent. Thus, customers can have access to IT services anytime, and from anywhere with an Internet connection.
Quality of Service (QoS). The operation of the IT infrastructure by a specialized provider that has extensive experience in running very large infrastructures (including its own infrastructure) increases QoS.
Elasticity. The ability to scale resources out, up and down dynamically to accommodate changing conditions is a major advantage. In particular, it makes it easy for customers to deal with sudden increases in loads by simply creating more virtual machines.
However, cloud computing has some drawbacks and not all applications are good candidates for being “cloudified”. The major concern is w.r.t. data security and privacy, and trust in the provider (which may use no so trustful providers to operate). One earlier criticism of cloud computing was that customers get locked in proprietary clouds. It is true that most clouds are proprietary and there are no standards for cloud interoperability. But this is changing with open source cloud software such as Hadoop, an Apache project implementing Google's major cloud services such as Google File System and MapReduce, and Eucalyptus, an open source cloud software infrastructure, which are attracting much interest from research and industry.
There is much more variety in cloud data than in scientific data since there are many different kinds of customers (individuals, SME, large corporations, etc.). However, we can identify common features. Cloud data can be very large, unstructured (e.g. text-based) or semi-structured, and typically append-only (with rare updates). And cloud users and application developers may be in high numbers, but not DBMS experts.
Big data has become a buzz word, with different meanings depending on your perspective, e.g. 100 terabytes is big for a transaction processing system, but small for a web search engine. It is also a moving target, as shown by two landmarks in DBMS products: the Teradata database machine in the 1980's and the Oracle Exadata database machine in 2010.
Although big data has been around for a long time, it is now more important than ever. We can see overwhelming amounts of data generated by all kinds of devices, networks and programs, e.g. sensors, mobile devices, internet, social networks, computer simulations, satellites, radiotelescopes, etc. Storage capacity has doubled every 3 years since 1980 with prices steadily going down (e.g. 1 Gigabyte for: 1M$ in 1982, 1K$ in 1995, 0.12$ in 2011), making it affordable to keep more data. And massive data can produce high-value information and knowledge, which is critical for data analysis, decision support, forecasting, business intelligence, research, (data-intensive) science, etc.
The problem of big data has three main dimensions, quoted as the three big V's:
Volume: refers to massive amounts of data, making it hard to store, manage, and analyze (big analytics);
Velocity: refers to continuous data streams being produced, making it hard to perform online processing and analysis;
Variety: refers to different data formats, different semantics, uncertain data, multiscale data, etc., making it hard to integrate and analyze.
There are also other V's like: validity (is the data correct and accurate?); veracity (are the results meaningful?); volatility (how long do you need to store this data?).
Current big data management (NoSQL) solutions have been designed for the cloud, as cloud and big data are synergistic. They typically trade consistency for scalability, simplicity and flexibility. They use a radically different architecture than RDBMS, by exploiting (rather than embedding) a distributed file system such as Google File System (GFS) or Hadoop Distributed File System (HDFS), to store and manage data in a highly fault-tolerant manner. They tend to rely on a more specific data model, e.g. key-value store such as Google Bigtable, Hadoop Hbase or Apache CouchDB) with a simple set of operators easy to use from a programming language. For instance, to address the requirements of social network applications, new solutions rely on a graph data model and graph-based operators. User-defined functions also allow for more specific data processing. MapReduce is a good example of generic parallel data processing framework, on top of a distributed file system (GFS or HDFS). It supports a simple data model (sets of (key, value) pairs), which allows user-defined functions (map and reduce). Although quite successful among developers, it is relatively low-level and rigid, leading to custom user code that is hard to maintain and reuse. In Zenith, we exploit or extend MapReduce and NoSQL technologies to fit our needs for scientific workflow management and scalable data analysis.
Data uncertainty is present in many scientific applications. For instance, in the monitoring of plant contamination by INRA teams, sensors generate periodically data which may be uncertain. Instead of ignoring (or correcting) uncertainty, which may generate major errors, we need to manage it rigorously and provide support for querying.
To deal with uncertainty, there are several approaches, e.g. probabilistic, possibilistic, fuzzy logic, etc. The probabilistic approach is often used by scientists to model the behavior of their underlying environments. However, in many scientific applications, data management and uncertain query processing are not integrated, i.e., the queries are usually answered using ad-hoc methods after doing manual or semi-automatic statistical treatment on the data which are retrieved from a database. In Zenith, we aim at integrating scientific data management and query processing within one system. This should allow scientists to issue their queries in a query language without thinking about the probabilistic treatment which should be done in background in order to answer the queries. There are two important issues which any PDBMS should address: 1) how to represent a probabilistic database, i.e., data model; 2) how to answer queries using the chosen representation, i.e., query evaluation.
One of the problems on which we focus is scalable query processing over uncertain data. A naive solution for evaluating probabilistic queries is to enumerate all possible worlds, i.e., all possible instances of the database, execute the query in each world, and return the possible answers together with their cumulative probabilities. However, this solution can not scale up due to the exponential number of possible worlds which a probabilistic database may have. Thus, the problem is quite challenging, particularly due to the exponential number of possibilities that should be considered for evaluating queries. In addition, most of our underlying scientific applications are not centralized; the scientists share part of their data in a P2P manner. This distribution of data makes very complicated the processing of probabilistic queries. To develop efficient query processing techniques for distributed scientific applications, we can take advantage of two main distributed technologies: P2P and Cloud. Our research experience in P2P systems has proved us that we can propose scalable solutions for many data management problems. In addition, we can use the cloud parallel solutions, e.g. MapReduce, to parallelize the task of query processing, when possible, and answer queries of scientists in reasonable execution times. Another challenge for supporting scientific applications is uncertain data integration. In addition to managing the uncertain data for each user, we need to integrate uncertain data from different sources. This requires revisiting traditional data integration in major ways and dealing with the problems of uncertain mediated schema generation and uncertain schema mapping.
Nowadays, scientists can rely on web 2.0 tools to quickly share their data and/or knowledge (e.g. ontologies of the domain knowledge). Therefore, when performing a given study, a scientist would typically need to access and integrate data from many data sources (including public databases). To make high numbers of scientific data sources easily accessible to community members, it is necessary to identify semantic correspondences between metadata structures or models of the related data sources. The main underlying task is called matching, which is the process of discovering semantic correspondences between metadata structures such as database schema and ontologies. Ontology is a formal and explicit description of a shared conceptualization in terms of concepts (i.e., classes, properties and relations). For example, the matching may be used to align gene ontologies or anatomical metadata structures.
To understand a data source content, metadata (data that describe the data) is crucial. Metadata can be initially provided by the data publisher to describe the data structure (e.g. schema), data semantics based on ontologies (that provide a formal representation of the domain knowledge) and other useful information about data provenance (publisher, tools, methods, etc.). Scientific metadata is very heterogeneous, in particular because of the great autonomy of the underlying data sources, which leads to a large variety of models and formats. The high heterogeneity makes the matching problem very challenging. Furthermore, the number of ontologies and their size grow fastly, and so does their diversity and heterogeneity. As a result, schema/ontology matching has become a prominent and challenging topic.
Data mining provides methods to discover new and useful patterns from very large sets of data. These patterns may take different forms, depending on the end-user's request, such as:
Frequent itemsets and association rules . In this case, the data is usually a table with a high number of rows and the algorithm extracts correlations between column values. This problem was first motivated by commercial and marketing purposes (e.g. discovering frequent correlations between items bought in a shop, which could help selling more). A typical example of frequent itemset from a sensor network in a smart building would say that “in 20% rooms, the door is closed, the room is empty, and lights are on.”
Frequent sequential pattern extraction. This problem is very similar to frequent itemset mining, but in this case, the order between events has to be considered. Let us consider the smart-building example again. A frequent sequence, in this case, could say that “in 40% rooms, lights are on at time i, the room is empty at time i+j and the door is closed at time i+j+k”. Discovering frequent sequences has become a crucial need in marketing, but also in security (detecting network intrusions for instance) in usage analysis (web usage is one of the main applications) and any domain where data arrive in a specific order (usually given by timestamps).
Clustering . The goal of clustering algorithms is to group together data that have similar characteristics, while ensuring that dissimilar data will not be in the same cluster. In our example of smart buildings, we would find clusters of rooms, where offices will be in one category and copy machine rooms in another one because of their characteristics (hours of people presence, number of times lights are turned on and off, etc.).
One of the main problems for data mining methods has been to deal with data streams. Actually, data mining methods have first been designed for very large data sets where complex algorithms of artificial intelligence were not able to complete within reasonable time responses because of data size. The problem was thus to find a good trade-off between response time and results relevance. The patterns described above well match this trade-off since they both provide interesting knowledge for data analysts and allow algorithm having good time complexity on the number of records. Itemset mining algorithms, for instance, depend more on the number of columns (for a sensor it would be the number of possible items such as temperature, presence, status of lights, etc.) than the number of lines (number of sensors in the network). However, with the ever growing size of data and their production rate, a new kind of data source has recently emerged as data streams. A data stream is a sequence of events arriving at high rate. By “high rate”, we usually admit that traditional data mining methods reach their limits and cannot complete in real-time, given the data size. In order to extract knowledge from such streams, a new trade-off had to be found and the data mining community has investigated approximation methods that could allow to maintain a good quality of results for the above patterns extraction.
For scientific data, data mining now has to deal with new and challenging characteristics. First, scientific data is often associated to a level of uncertainty (typically, sensed values have to be associated to the probability that this value is correct or not). Second, scientific data might be extremely large and need cloud computing solutions for their storage and analysis. Eventually, we will have to deal with high dimension and heterogeneous data.
Today's technologies for searching information in scientific data mainly rely on relational DBMS or text-based indexing methods. However, content-based information retrieval has progressed much in the last decade and is now considered as one of the most promising for future search engines. Rather than restricting search to the use of metadata, content-based methods attempt to index, search and browse digital objects by means of signatures describing their actual content. Such methods have been intensively studied in the multimedia community to allow searching the massive amount or raw multimedia documents created every day (e.g. 99% of web data are audio-visual content with very sparse metadata). Successful and scalable content-based methods have been proposed for searching objects in large image collections or detecting copies in huge video archives. Besides multimedia contents, content-based information retrieval methods recently started to be studied on more diverse data such as medical images, 3D models or even molecular data. Potential applications in scientific data management are numerous. First of all, to allow searching the huge collections of scientific images (earth observation, medical images, botanical images, biology images, etc.) but also to browse large datasets of experimental data (e.g. multisensor data, molecular data or instrumental data). Despite recent progress, scalability remains a major issue, involving complex algorithms (such as similarity search, clustering or supervised retrieval), in high dimensional spaces (up to millions of dimensions) with complex metrics (Lp, Kernels, sets intersections, edit distances, etc.). Most of these algorithms have linear, quadratic or even cubic complexities so that their use at large scale is not affordable without consistent breakthrough. In Zenith, we plan to investigate the following challenges:
High-dimensional similarity search. Whereas many indexing methods were designed in the last 20 years to efficiently retrieve multidimensional data with relatively small dimensions, high-dimensional data have been more challenging due to the well-known dimensionality curse. Only recently have some methods appeared that allow approximate Nearest Neighbors queries in sub-linear time. In particular, Locality Sensitive Hashing methods which offer new theoretical insights in high-dimensional Euclidean spaces and proved the interest of random projections. But there are still some challenging issues that need to be solved including efficient similarity search in any kernel or metric spaces, efficient construction of knn-graphs or relational similarity queries.
Large-scale supervised retrieval. Supervised retrieval aims at retrieving relevant objects in a dataset by providing some positive and/or negative training samples. To solve such a task, there has been a focused interest on using Support Vector Machines (SVM) that offer the possibility to construct generalized, non-linear predictors in high-dimensional spaces using small training sets. The prediction time complexity of these methods is usually linear in dataset size. Allowing hyperplane similarity queries in sub-linear time is for example a challenging research issue. A symmetric problem in supervised retrieval consists in retrieving the most relevant object categories that might contain a given query object, providing huge labeled datasets (up to millions of classes and billions of objects) and very few objects per category (from 1 to 100 objects). SVM methods that are formulated as quadratic programming with cubic training time complexity and quadratic space complexity are clearly not usable. Promising solutions to such problems include hybrid supervised-unsupervised methods and supervised hashing methods.
Distributed content-based retrieval. Distributed content-based retrieval methods appeared recently as a promising solution to manage masses of data distributed over large networks, particularly when the data cannot be centralized for privacy or cost reasons (which is often the case in scientific social networks, e.g. botanist social networks). However, current methods are limited to very simple similarity search paradigms. In Zenith, we will consider more advanced distributed content-based retrieval and mining methods such as k-nn graphs construction, large-scale supervised retrieval or multi-source clustering.
The application domains covered by Zenith are very wide and diverse, as they concern data-intensive scientific applications, i.e., most scientific applications. Since the interaction with scientists is crucial to identify and tackle data management problems, we are dealing primarily with application domains for which Montpellier has an excellent track record, i.e., agronomy, environmental science, life science, with scientific partners like INRA, IRD, CIRAD and IRSTEA. However, we are also addressing other scientific domains (e.g. astronomy, oil extraction) through our international collaborations (e.g. in Brazil).
Let us briefly illustrate some representative examples of scientific applications on which we have been working on.
Management of astronomical catalogs. An example of data-intensive scientific applications is the management of astronomical catalogs generated by the Dark Energy Survey (DES) project on which we are collaborating with researchers from Brazil. In this project, huge tables with billions of tuples and hundreds of attributes (corresponding to dimensions, mainly double precision real numbers) store the collected sky data. Data are appended to the catalog database as new observations are performed and the resulting database size is estimated to reach 100TB very soon. Scientists around the globe can query the database with queries that may contain a considerable number of attributes. The volume of data that this application holds poses important challenges for data management. In particular, efficient solutions are needed to partition and distribute the data in several servers. An efficient partitioning scheme should try to minimize the number of fragments accessed in the execution of a query, thus reducing the overhead associated to handle the distributed execution.
Personal health data analysis and privacy The “Quantified Self” movement has gained a large popularity these past few years. Today, it is possible to acquire data on many domains related to personal data. For instance, one can collect data on her daily activities, habits or health. It is also possible to measure performances in sports. This can be done thanks to sensors, communicating devices or even connected glasses (as currently being developped by companies such as Google, for instance). Obviously, such data, once acquired, can lead to valuable knowledge for these domains. For people having a specific disease, it might be important to know if they belong to a specific category that needs particular care. For an individual, it can be interesting to find a category that corresponds to her performances in a specific sport and then adapt her training with an adequate program. Meanwhile, for privacy reasons, people will be reluctant to share their personal data and make them public. Therefore, it is important to provide them solutions that can extract such knowledge from everybody's data, while guaranteeing that their private data won't be disclosed to anyone.
Botanical data sharing. Botanical data is highly decentralized and heterogeneous. Each actor has its own expertise domain, hosts its own data, and describes them in a specific format. Furthermore, botanical data is complex. A single plant's observation might include many structured and unstructured tags, several images of different organs, some empirical measurements and a few other contextual data (time, location, author, etc.). A noticeable consequence is that simply identifying plant species is often a very difficult task; even for the botanists themselves (the so-called taxonomic gap). Botanical data sharing should thus speed up the integration of raw observation data, while providing users an easy and efficient access to integrated data. This requires to deal with social-based data integration and sharing, massive data analysis and scalable content-based information retrieval. We address this application in the context of the French initiative Pl@ntNet, with CIRAD and IRD.
Deepwater oil exploitation. An important step in oil exploitation is pumping oil from ultra-deepwater from thousand meters up to the surface through long tubular structures, called risers. Maintaining and repairing risers under deep water is difficult, costly and critical for the environment. Thus, scientists must predict risers fatigue based on complex scientific models and observed data for the risers. Risers fatigue analysis requires a complex workflow of data-intensive activities which may take a very long time to compute. A typical workflow takes as input files containing riser information, such as finite element meshes, winds, waves and sea currents, and produces result analysis files to be further studied by the scientists. It can have thousands of input and output files and tens of activities (e.g. dynamic analysis of risers movements, tension analysis, etc.). Some activities, e.g. dynamic analysis, are repeated for many different input files, and depending on the mesh refinements, each single execution may take hours to complete. To speed up risers fatigue analysis requires parallelizing workflow execution, which is hard to do with existing systems. We address this application in collaboration with UFRJ, and Petrobras.
These application examples illustrate the diversity of requirements and issues which we are addressing with our scientific application partners. To further validate our solutions and extend the scope of our results, we also want to foster industrial collaborations, even in non scientific applications, provided that they exhibit similar challenges.
URL: https://
Apache Hadoop provides an open-source framework for reliable, scalable, parallel computing. It can be deployed and used in large-scale platforms such as Grid 5000. However, its configuration and management is very difficult, specially under the dynamic nature of clusters. Therefore, we built Hadoop_g5k (Hadoop easy deployment in clusters), a tool that makes it easier to manage Hadoop clusters and prepare reproducible experiments. Hadoop_g5k offers a set of scripts to be used in command-line interfaces and a Python interface. It is actually used by Grid5000 users, and helps them saving much time when doing their experiments with MapReduce.
URL: https://
LogMagnet is a software for analyzing streaming data, and in particular log data. Log data usually arrive in the form of lines containing activities of human or machines. In the case of human activities, it may be the behavior on a Web site or the usage of an application. In the case of machines, such log may contain the activities of software and hardware components (say, for each node of a computing cluster, the calls to system functions or some hardware alerts). Analyzing such data is often difficult and crucial in the meanwhile. LogMagnet allows to summarize this data, and to provide a first analysis as a clustering. This summary may also be exploited as easily as the original data.
URL: https://
Recommender systems are used as a mean to supply users with content that may be of interest to them. They have become a popular research topic, where many aspects and dimensions have been studied to make them more accurate and effective. In practice, recommender systems suffer from cold-start problems. However, users use many online services, which can provide information about their interest and the content of items (e.g. Google search engine, Facebook, Twitter, etc). These services may be valuable data sources, which supply information to help a recommender system in modeling users and items’ preferences, and thus, make the recommender system more precise. Moreover, these data sources are distributed, and geographically distant from each other, which raise many research problems and challenges to design a distributed recommendation algorithm. MultiSite-Rec is a distributed collaborative filtering algorithm, which exploits and combine these multiple and heterogeneous data sources to improve the recommendation quality.
URL: http://
PlantRT is a distributed gossip-based platform for content sharing enabling plants observation keywords search and GPS position based recommendation. It combines advantages from centralized and P2P systems.
URL: http://
Pl@ntNet is an image sharing and retrieval application for the identification of plants. It is developed in the context of the Pl@ntNet project that involves four French research organisations (Inria, Cirad, INRA, IRD) and the members of Tela Botanica social network. The key feature of the iOS and Android front ends is to help identifying plant species from photographs, through a server-side visual search engine based on several results of ZENITH team on content-based information retrieval. Since its first release in March 2013 on the apple store, the application was downloaded by around 300K users in more than 150 countries (between 500 and 5000 active users daily with peaks occurring during the week-ends). The collaborative training set that allows the content-based identification is continuously enriched by the users of the application and the members of Tela Botanica social network. At the time of writing, it includes about 100K images covering more than 5000 French plant species about 4/5 of the whole French flora (this is actually the widest identification tool built anytime).
URL: http://
SON is an open source development platform for P2P networks using web services, JXTA and OSGi. SON combines three powerful paradigms: components, SOA and P2P. Components communicate by asynchronous message passing to provide weak coupling between system entities. To scale up and ease deployment, we rely on a decentralized organization based on a DHT for publishing and discovering services or data. In terms of communication, the infrastructure is based on JXTA virtual communication pipes, a technology that has been extensively used within the Grid community. Using SON, the development of a P2P application is done through the design and implementation of a set of components. Each component includes a technical code that provides the component services and a code component that provides the component logic (in Java). The complex aspects of asynchronous distributed programming (technical code) are separated from code components and automatically generated from an abstract description of services (provided or required) for each component by the component generator.
Snoop is a generalist C++ library dedicated to high-dimensional data management and efficient similarity search. Its main features are dimension reduction, high-dimensional feature vectors hashing, approximate k-nearest neighbors search and Hamming embedding. Snoop is a refactoring of a previous library called PMH developed jointly with the French National Institute of Audiovisual. It is based on the joined research work of Alexis Joly and Olivier Buisson. SnoopIm is a content-based image search engine built on top of Snoop and allowing to retrieve small visual patterns or objects in large collections of pictures. The software is being experimented in several contexts including a logo retrieval application set up in collaboration with the French Press Agency, an experimental plant identification tool mixing textual and visual information retrieval (in the context of the Pl@ntNet project) and a research project on high-throughput analysis of root architecture images.
URL: http://
SciFloware is an action of technology development (ADT Inria) with the goal of developing a middleware for the execution of scientific workflows in a distributed and parallel way. It capitalizes on our experience with SON and an innovative algebraic approach to the management of scientific workflows. SciFloware provides a development environment and a runtime environment for scientific workflows, interoperable with existing systems. We validate SciFloware with workflows for analyzing biological data provided by our partners CIRAD, INRA and IRD.
URL: http://
In the context of an action of technology development (ADT) started in october 2010, WebSmatch is a flexible, open environment for discovering and matching complex schemas from many heterogeneous data sources over the Web. It provides three basic functions: (1) metadata extraction from data sources; (2) schema matching (both 2-way and n-way schema matching), (3) schema clustering to group similar schemas together. WebSmatch is being delivered through Web services, to be used directly by data integrators or other tools, with RIA clients. Implemented in Java, delivered as Open Source Software (under LGPL) and protected by a deposit at APP (Agence de Protection des Programmes). WebSmatch is being used by Datapublica and CIRAD to integrate public data sources.
Patrick Valduriez received the 2014 Innovation Prize from Inria – Académie des sciences – Dassault Systems.
Miguel Liroz-Gistau received the best presentation award from the Grid5000 Spring School 2014 in Lyon for his talk on “Using Grid5000 for MapReduce Experiments”.
Triton, a new common lab. (i-lab) has been created between Zenith and Beepeers (beepeers.com) to work on a platform for developing social networks in mobile/Web environments.
127 research groups worldwide registered to the LifeCLEF 2014 evaluation campaign chaired by Alexis Joly.
Data uncertainty in scientific applications can be due to many different reasons: incomplete knowledge of the underlying system, inexact model parameters, inaccurate representation of initial boundary conditions, inaccuracy in equipments, error in data entry, etc.
An important problem that arises in big data integration is that of Entity Resolution (ER). ER is the process of identifying tuples that represent the same real-world entity. The problem of entity resolution over probabilistic data (which we call ERPD) arises in many distributed application domains that have to deal with probabilistic data, ranging from sensor databases to scientific data management.
The ERPD problem can be formally defined as follows.
Let
Many real-life applications produce uncertain data distributed among a number of databases. Dealing with the ERPD problem for distributed data is quite important for such applications. A straightforward approach for answering distributed ERPD queries is to ask all distributed nodes to send their databases to a central node that deals with the problem of ER by using one of the existing centralized solutions. However, this approach is very expensive and does not scale well neither in the size of databases, nor in the number of nodes.
In , we propose an efficient solution for the ERPD problem. Our contributions are summarized as follows. We adapted the possible worlds semantics of probabilistic data to define the problem of ERPD based on both similarity and probability of tuples. We proposed a PTIME algorithm for the ERPD problem. This algorithm is applicable to a large class of the similarity functions, where the similarity score of two tuples depends only on their attributes i.e., context-free functions. For the rest of similarity functions (i.e., context-sensitive), we proposed a Monte Carlo approximation algorithm. We also proposed a parallel version of our Monte Carlo algorithm using the MapReduce framework. We conducted an extensive experimental study to evaluate our approach for ERPD over both real and synthetic datasets. The results show the effectiveness of our algorithms.
Another topic of interest is the integration of large astronomy data catalogs. The main challenge in such integration, besides the huge amount of catalog data to be merged, is the weak identification of sky objects, which leads to ambiguities in object matching amongst catalogs. In cite , we present the NACluster algorithm. NACluster considers a Euclidian metric space and distance function to drive disambiguation amongst objects in various catalogs and extends the traditional k-means algorithm to deal with the dynamic creation of new clusters, representing real sky objects. NACluster shows F-measure results steadily superior to the Q3C join operator matching results, which is its closest competitor.
The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. The CoherentPaaS European project addresses this problem, by providing a common programming language and holistic coherence across different cloud data stores.
In this context, we have started the design of a Cloud Multi-datastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQLlike language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. Thus, CloudMdsQL unifies a quite diverse set of data management technologies while preserving the expressivity of their local query languages. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatabase query language.
Biologist have adopted ontologies for several reasons: (1) to provide canonical representation of scientific knowledge; (2) to annotate experimental data to enable interpretation, comparison, and discovery across databases; (3) to facilitate knowledge-based applications for decision support, natural language processing and data integration. The challenge is to automatically process complex databases and generate mappings using relevant ontologies in a way that scales up for many resources and ontologies, while being easy to use for the biomedical community, customizable to fit specific needs and smart, in order to leverage the knowledge contained in ontologies.
The National Center for Biomedical Ontology (NCBO) has developped a popular ontology-based annotation workflow. To address the above challenge, we have integrated the NCBO annotator with our WebSmatch tool and the Biosemantic tool from IRD to perform semantic annotation using bio-ontologies . The resulting tool provides very useful capabilities. First, it can convert SQL database schemas to RDF/RDFS with Biosemantic. Second, it can annotate with the NCBO annotator and WebSmatch using the NCBO resources index. Third, the NCBO annotator relies on WebSmatch to create mappings between elements of schemas and ontological concepts, and uses ontologies properties (i.e. subsomption, transitivity) to enhance matching techniques.
Unlike the bio-medical domain which has accepted ontologies as a means to manage (integrate) knowledge, the agronomic sciences is yet to exploit its full potential. To this end, we are currently developing an RDF knowledge base, Agronomic Linked Data (AgroLD) . The knowledge base is designed to integrate data from various publically available plant centric data sources. The aim of AgroLD project is to collaborate with domain experts in bridging the gap between technology and its potential users to enhance biological research.
We consider peer-to-peer data management systems (PDMS), where each peer maintains mappings between its schema and some acquaintances, along with social links with peer friends. In this context, we deal with reformulating conjunctive queries from a peer’s schema into other peer’s schemas. Precisely, queries against a peer node are rewritten into queries against other nodes using schema mappings thus obtaining query rewritings. Unfortunately, not all the obtained rewritings are relevant to a given query, as the information gain may be negligible or the peer is not worth exploring. On the other hand, the existence of social links with peer friends might be useful to get relevant rewritings.
In , we propose a new notion of “relevance” of a query with respect to a mapping that encompasses both a local relevance (the relevance of the query wrt. the mapping) and a global relevance (the relevance of the query wrt. the entire network). Based on this notion, we design a new query reformulation approach for social PDMS which achieves great accuracy and flexibility. We combine several techniques: (i) social links are expressed as FOAF (Friend of a Friend) links to characterize peer’s friendship; (ii) concise mapping summaries are used to obtain mapping descriptions; (iii) local semantic views are special views that contain information about mappings captured from the network by using gossiping techniques. Our experimental evaluation, based on a prototype on top of PeerSim and a simulated network demonstrate that our solution yields greater recall, compared to traditional query translation approaches proposed in the literature.
Recommendation is becoming a popular mechanism to help users find relevant information in large-scale data (scientific data, web). Different diversification techniques have been proposed to avoid redundancy in the process of recommendation. Intuitively, the goal of recommendation diversification is to identify a list of items that are dissimilar, but nonetheless relevant to the user's interests.
The main goal of this work , is to define a new diversified search and recommendation solution suited for scientific data (i.e., plant phenotyping, botanical data). We first propose an original profile diversification scoring function that enables to address the problem of returning redundant items, and enhances the quality of diversification compared to the state-of-the-art solutions. We believe our work is the first to investigate profile diversity to address the problem of returning highly popular but too-focused items.Through experimental evaluation using two benchmarks we showed that our scoring function presents the best compromise between diversity and relevancy. Next, to implement our new scoring function, we propose a Top-k threshold-based algorithm that exploits a candidate list to achieve diversification. However this algorithm is greedy and does not scale up well.To overcome this limitation, we propose several techniques to improve performance. First, we simplify the scoring model to reduce its computational complexity. Second, we propose two techniques to reduce the number of items in the candidate list, and therefore the number of diversified scores to compute. Third, we propose different indexing scores (i.e., the score used to sort the items in the inverted lists) that take into account the diversification of items, and using them, we developed an adaptive indexing approach to reduce the number of accesses in the index dynamically based on the queries workload. We evaluated the performance of our techniques through experimentation. The results show that they enable to reduce the response time up to 12 times compared to a baseline greedy diversification algorithm.
We also address the problem of distributed and diversified recommendation (P2P and multi-site) that fits very well in different application scenarios. We propose a new scoring function (usefulness) to cluster relevant users over a distributed overlay. We analyzed the new clustering algorithm in details, and we studied its behavior with an experimental evaluation using different datasets. Compared with state-of-the-art solutions, we obtain major gains in recall (order of 3 times).
With the increasing popularity of scientific workflows, public and private repositories are gaining importance as a means to share, find, and reuse such workflows. As the sizes of workflows repositories grow, methods to compare the scientific workflows stored in them become a necessity, for instance, to allow duplicate detection or similarity search. Scientific workflows are complex objects, and their comparison entails a number of distinct steps from comparing atomic elements to comparison of the workflows as a whole. Various studies have implemented methods for scientific workflow comparison and came up with often contradicting conclusions upon which algorithms work best. Comparing these results is cumbersome, as the original studies mixed different approaches for different steps and used different evaluation data and metrics.
We first contribute to the field by (i) comparing in isolation different approaches taken at each step of scientific workflow comparison, reporting on an number of unexpected findings, (ii) investigating how these can best be combined into aggregated measures, and (iii) making available a gold standard of over 2000 similarity ratings contributed by 15 workflow experts on a corpus of 1500 workflows and re-implementations of all methods we evaluated.
Then, we present a novel and intuitive workflow similarity measure that is based on layer decomposition . Layer decomposition accounts for the directed dataflow underlying scientific workflows, a property which has not been adequately considered in previous methods. We comparatively evaluate our algorithm using our gold standard and show that it a) delivers the best results for similarity search, b) has a much lower runtime than other, often highly complex competitors in structure-aware workflow comparison, and c) can be stacked easily with even faster, structure-agnostic approaches to further reduce runtime while retaining result quality.
As the scale of the data increases, scientific workflow management systems (SWfMSs) need to support workflow execution in High Performance Computing (HPC) environments. Because of various benefits, cloud emerges as an appropriate infrastructure for workflow execution. However, it is difficult to execute some scientific workflows in one cloud site because of geographical distribution of scientists, data and computing resources. Therefore, a scientific workflow often needs to be partitioned and executed in a multisite environment.
In , we define a multisite cloud architecture that is composed of traditional clouds, e.g., a pay-per-use cloud service such as Amazon EC2, private data-centers, e.g. a cloud of a scientific organization like Inria, COPPE or LNCC, and client desktop machines that have authorized access to the data-centers. We can model this architecture as a distributed system on the Internet, each site having its own computer cluster, data and programs. An important requirement is to provide distribution transparency for advanced services (i.e., workflow management, data analysis), to ease their scalability and elasticity. Current solutions for multisite clouds typically rely on application specific overlays that map the output of one task at a site to the input of another in a pipeline fashion. Instead, we define fully distributed services for data storage, intersite data movement and task scheduling.
Also, SWfMSs generally execute a scientific workflow in parallel within one site. In , we propose a non-intrusive approach to execute scientific workflows in a multisite cloud with three workflow partitioning techniques. We describe an experimental validation using an adaptation of Chiron SWfMS for Microsoft Azure multisite cloud. The experiment results reveal the efficiency of our partitioning techniques, and their superiority in different environments.
Dynamic workflows are scientific workflows supporting computational science simulations, typically using dynamic processes based on runtime scientific data analyses. They require the ability of adapting the workflow, at runtime, based on user input and dynamic steering. Supporting data-centric iteration is an important step towards dynamic workflows because user interaction with workflows is iterative. However, current support for iteration in scientific workflows is static and does not allow for changing data at runtime.
In , we propose a solution based on algebraic operators and a dynamic execution model to enable workflow adaptation based on user input and dynamic steering. We introduce the concept of iteration lineage that makes provenance data management consistent with dynamic iterative workflow changes. Lineage enables scientists to interact with workflow data and configuration at runtime through an API that triggers steering. We evaluate our approach using a novel and real large-scale workflow for uncertainty quantification on a 640-core cluster. The results show impressive execution time savings from 2.5 to 24 days, compared to non-iterative workflow execution. We verify that the maximum overhead introduced by our iterative model is less than 5% of execution time. Also, our proposed steering algorithms are very efficient and run in less than 1 millisecond, in the worst-case scenario.
The amount of data that is captured or generated by modern computing devices has augmented exponentially over the last years. For processing this big data, parallel computing has been a major solution in both industry and research. This is why, the MapReduce framework, which provides automatic distribution parallelization and fault-tolerance in a transparent way over lowcost machines, has become one of the standards in big data analysis.
For processing a big dataset over a cluster of nodes, one main step is data partitioning (or fragmentation) to divide the dataset to the nodes. In , we consider applications with very large databases, where data items are continuously appended. Thus, the development of efficient data partitioning is one of the main requirements to yield good performance. In particular, this problem is harder in the case of some scientific databases, such as astronomical catalogs. The complexity of the schema limits the applicability of traditional automatic approaches based on the basic partitioning techniques. The high dynamicity makes the usage of graph-based approaches impractical, as they require to consider the whole dataset in order to come up with a good partitioning scheme. In our work, we propose DynPart and DynPartGroup, two dynamic partitioning algorithms for continuously growing databases . These algorithms efficiently adapt the data partitioning to the arrival of new data elements by taking into account the affinity of new data with queries and fragments. In contrast to existing static approaches, our approach offers constant execution time, no matter the size of the database, while obtaining very good partitioning efficiency. We validate our solution through experimentation over real-world data; the results show its effectiveness.
We address the problem of data skew in MapReduce parallel processing framework. There are many cases where because of skew intermediate data, a high percentage of processing in the reduce side of MapReduce is done by a few nodes, or even one node, while the others remain idle.There have been some attempts to address this problem of data skew, but only for specific cases. In particular, there is no solution when all or most of the intermediate values correspond to a single key, or to a set of keys that are fewer than the number of reduce workers.
In this work, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and can efficiently deal with the problem of reduce side data skew. We extended the programming model of MapReduce to allow the collaboration of reduce workers on processing the values of an intermediate key, without affecting the correctness of the final results. In FP-Hadoop, the reduce function is replaced by two functions: intermediate reduce and final reduce. There are three phases, each phase corresponding to one of the functions: map, intermediate reduce and final reduce phases. In the intermediate reduce phase, the intermediate reduce function, which usually includes the main load of reducing in MapReduce jobs, is executed by reduce workers in a collaborative way, even if all values belong to only one intermediate key. This allows performing a big part of the reducing work by using the computing resources of all workers, even in the case of highly skewed data. We implemented a prototype of FP-Hadoop by modifying Hadoop’s code, and conducted extensive experiments over synthetic and real datasets. The results show that FP-Hadoop makes MapReduce job processing much faster and more parallel, and can efficiently deal with skewed data. We achieve excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.
In recent years, there has been a growing interest for probabilistic data management. In , we focus on probabilistic time series where a main characteristic is the high volumes of data, calling for efficient compression techniques. To date, most work on probabilistic data reduction has provided synopses that minimize the error of representation w.r.t. the original data. However, in most cases, the compressed data will be meaningless for usual queries involving aggregation operators such as SUM or AVG. We propose PHA (Probabilistic Histogram Aggregation), a compression technique whose objective is to minimize the error of such queries over compressed probabilistic data. We incorporate the aggregation operator given by the end-user directly in the compression technique, and obtain much lower error in the long term. We also adopt a global error aware strategy in order to manage large sets of probabilistic time series, where the available memory is carefully balanced between the series, according to their individual variability.
Usage mining is a significant research area with applications in various fields. However, Web usage data is usually considered streaming, due to its high volumes and rates. Because of these characteristics, we only have access, at any point in time, to a small fraction of the stream. When the data is observed through such a limited window, it is challenging to give a reliable description of the recent usage data. In we show that data intralinkings, i.e., a usage record (event) may be associated with other records (events) in the same dataset, are common for Web usage streams. Therefore, in order to have a more authentic grasp of Web usage behaviors, the corresponding data stream models for Web usage streams should be able to process such intralinkings. We study the important consequences of the constraints and intralinkings, through the “bounce rate” problem and the clustering of usage streams. Then we propose the user-centric ABS (the Anti-Bouncing Stream) model which combines the advantages of previous models but avoids their drawbacks. First, ABS is the first data stream model that is able to seize the intralinkings between the Web usage records. It is also the first user-centric data stream model that can associate the usage records for the users in the Web usage streams. Second, owing to its simple but effective management principle, the data in ABS is available at any time for analysis. Under the same resource constraints as existing models in the literature, ABS can better model the recent data. Third, ABS can better measure the bounce rates for Web usage streams. We demonstrate its superiority through a theoretical study and experiments on two real-world data sets.
In , we propose a novel framework of autonomic intrusion detection that fulfills online and adaptive intrusion detection over unlabeled HTTP traffic streams in computer networks. The framework holds potential for self-managing: self-labeling, self-updating and self-adapting. Our framework employs the Affinity Propagation (AP) algorithm to learn a subject's behaviors through dynamical clustering of the streaming data. It automatically labels the data and adapts to normal behavior changes while identifying anomalies. Two large real HTTP traffic streams collected in our institute as well as a set of benchmark KDD'99 data are used to validate the framework and the method. The test results show that the autonomic model achieves better results in terms of effectiveness and efficiency compared to adaptive Sequential Karhunen-Loeve method and static AP as well as three other static anomaly detection methods, namely k-NN, PCA and SVM.
In , we consider the problem of recognizing legal entities in visual contents in a similar way to named-entity recognizers for text documents. Whereas previous works were restricted to the recognition of a few tens of logotypes, we generalize the problem to the recognition of thousands of legal persons, each being modeled by a rich corporate identity automatically built from web images. We therefore introduce a new geometrically-consistent instance-based classification method that has several benefits over state-of-the-art instance classification methods including an efficient training phase reduced to a simple indexing process with a linear time and space complexity, but also the easy management of multi-labeled images, the fine grained localisation of the recognized patterns or the possibility of dynamically inserting additional training images in an incremental way. Experiments show that our method achieves better results than state-of-the-art techniques while being much more scalable, notably on an automatic web crawl of 5,824 legal entities which demonstrates the scalability of the approach.
Building accurate knowledge of the identity, the geographic distribution and the evolution of living species is essential for a sustainable development of humanity as well as for biodiversity conservation. In this context, using crowdsourced data collection and multimedia identification tools is considered as one of the most promising solution. With the recent advances in digital devices/equipment, network bandwidth and information storage capacities, the production of multimedia data has indeed become an easy task. The emergence of citizen sciences and social networking tools has actually fostered the creation of large and structured communities of nature observers (e.g. e-bird, xeno-canto, Tela Botanica, etc.) who started to produce outstanding collections of multimedia records. Unfortunately, the performance of the state-of-the-art multimedia analysis techniques on such data is still not well understood and is far from reaching the real world’s requirements in terms of identification tools. We therefore created LifeCLEF , , , , a new lab of the CLEF international forum
Besides the organization of the campaign, we also participated to two tasks in order to evaluate the content-based retrieval technologies developed within ZENITH. We notably implemented a new method for the bird task based on the dense indexing of MFCC features and the offline pruning of the non-discriminant ones. To make such strategy scalable to the 30M of MFCC features extracted from the tens of thousands audio recordings of the training set, we used high dimensional hashing techniques coupled with an efficient approximate nearest neighbors search algorithm with controlled quality. Further improvements were obtained by (i) using a sliding classier with max pooling, (ii) weighting the query features according to their semantic coherence, and (iii) making use of the metadata to filter incoherent species. Results did show the effectiveness of the proposed technique which ranked 3rd among the 10 participating groups (some of them with years of experience in bioacoustic).
We finally investigated new interactive identification methods in , by extending classical faceted search mechanisms to the use of so called visual facets. The principle is to automatically build comprehensive visual illustrations of the expert data available in classical structured botanical dataset by building a visual matching graph of the related pictures and choosing the most connected ones. Additional facets can then be built automatically by clustering the graph and solving incompleteness issues.
Pl@ntNet is an innovative participatory sensing platform relying on image-based plants identication as a mean to enlist non-expert contributors and facilitate the production of botanical observation data . 18 months after the public launch of the iOS public application (and 6 months after the release of the Android version ), we carried out a self-critical evaluation of the experience with regard to the requirements of a sustainable and effective ecological surveillance tool (to appear in Multimedia Systems journal). Thanks to usage data analytics, we first demonstrated the attractiveness of the developed multimedia system (with more than 300K end-users and several thousands of users daily) as well as the nice self-improving capacities of the whole collaborative workflow (1.5 millions of observations were collected). We also pointed out the current limitations of the approach towards producing timely and accurate distribution maps of plants at a very large scale. We discussed in particular two main issues:
Data validation bottleneck: within the current workflow, only a few percentage of the observations are validated to avoid submerging the volunteer experts who actively do this job thanks to the collaborative web tools. There is consequently a need of smarter task assignment and recommendation mechanisms that would better balance the collaborative workload across all users and improves the serendipity.
Bias of the produced data: The temporal and geographical distribution of the observations is highly correlated with human activity. High densities of observations are more determined by population density and humans behavior than by plants density. This issue inevitably arises in any participatory sensing system but when the objective is to monitor noise nuisance or air quality, the concentration of the observations in the cities is less critical. There is therefore a need to build new data analytics methods compensating the bias through long-term statistics and the use of contextual information.
This joint project is on advanced data storage and processing for cloud workflows with the Kerdata team in the context of the Joint Inria – Microsoft Research Centre. The project addresses the problem of advanced data storage and processing for supporting scientific workflows in the cloud. The goal is to design and implement a framework for the efficient processing of scientific workflows in clouds. The validation h will be performed using synthetic benchmarks and real-life applications from bioinformatics: first on the Grid5000 platform in a preliminary phase, then on the Microsoft Azure cloud environment.
This project aims at developping new data mining techniques for P2P networks. The main goal is to preserve data privacy, while achieving good performance of analysis processes on the tackled data. More precisely, each participant in the P2P network has its own individual data (e.g. results of experiments for a scientific partner) and all the participants would like to acquire knowledge computed on the whole dataset (i.e., the union of all the individual data on the peers). Meanwhile, participants want a guarantee that no other participant will be able to see their data. The P2P protocol we have developped is now able to extract knowledge from the whole set of distributed data, while avoiding centralization, and guaranteeing data privacy for all peers. The work is currently the subject of a patent between EDF and Inria (patent number in progress).
Triton is a new common lab. (i-lab) created between Zenith and
Beepeers (beepeers.
URL: http://
We are participating in the Laboratory of Excellence (labex) NUMEV (Digital and Hardware Solutions, Modelling for the Environment and Life Sciences) headed by University of Montpellier 2 in partnership with CNRS, University of Montpellier 1, and Inria. NUMEV seeks to harmonize the approaches of hard sciences and life and environmental sciences in order to pave the way for an emerging interdisciplinary group with an international profile. The NUMEV project is decomposed in four complementary research themes: Modeling, Algorithms and computation, Scientific data (processing, integration, security), Model-Systems and measurements. Florent Masseglia co-heads (with Pascal Poncelet) the theme on scientific data.
URL: http://
IBC is a 5 year project with a funding of 2Meuros by the MENRT (“Investissements d'Avenir” program) to develop innovative methods and software to integrate and analyze biological data at large scale in health, agronomy and environment. Patrick Valduriez heads the workpackage on integration of biological data and knowledge.
The Datascale project is a “projet investissements d’avenir” on big data with Bull (leader), CEA, ActiveEon SAS, Armadillo, Twenga, IPGP, Xedix and Inria (Zenith) . The goal of the project is to develop the essential technologies for big data, including efficient data management, software architecture and database architecture, and demonstrate their scalability with representative applications. In this project, the Zenith team works on data mining with Hadoop MapReduce.
The X-data project is a “projet investissements d’avenir” on big data with Data Publica (leader), Orange, La Poste, EDF, Cinequant, Hurence and Inria (Indes, Planete and Zenith) . The goal of the project is to develop a big data plaftform with various tools and services to integrate open data and partners’s private data for analyzing the location, density and consuming of individuals and organizations in terms of energy and services. In this project, the Zenith team heads the workpackage on data integration.
The Pl@ntNet project http://
This CIFRE contract with INA allows funding a 3-years PhD (Valentin Leveau). This PhD addresses research challenges related to large-scale supervised content-based retrieval notably in distributed environments.
This project deals with the problems of big data in the context of life science, where masses of data are being produced, e.g. by Next Generation Sequencing technologies or plant phenotyping platforms. In this project, Zenith addresses the specific problems of large-scale data analysis and data sharing.
Project title: A Coherent and Rich Platform as a Service with a Common Programming Model
Instrument: Integrated Project
Duration: 2013 - 2016
Total funding: 5 Meuros (Zenith: 500Keuros)
Coordinator: U. Madrid, Spain
Partner: FORTH (Greece), ICCS (Greece), INESC (Portugal) and the companies MonetDB (Netherlands), QuartetFS (France), Sparsity (Spain), Neurocom (Greece), Portugal Telecom (Portugal).
Inria contact: Patrick Valduriez
Accessing and managing large amounts of data is becoming a major obstacle to developing new cloud applications and services with correct semantics, requiring tremendous programming effort and expertise. CoherentPaaS addresses this issue in the cloud PaaS landscape by developing a PaaS that incorporates a rich and diverse set of cloud data management technologies, including NoSQL data stores, such as key-value data stores and graph databases, SQL data stores, such as in-memory and column-oriented databases, hybrid systems, such as SQL engines on top on key-value data stores, and complex event processing data management systems. It uses a common query language to unify the programming models of all systems under a single paradigm and provides holistic coherence across data stores using a scalable, transactional management system. CoherentPaaS will dramatically reduce the effort required to build and the quality of the resulting cloud applications using multiple cloud data management technologies via a single query language, a uniform programming model, and ACID-based global transactional semantics. CoherentPaaS will design and build a working prototype and will validate the proposed technology with real-life use cases. In this project, Zenith is in charge of designing an SQL-like query language to query multiple databases (SQL, NoSQL) in a cloud and implementing a compiler/optimizer and query engine for that language.
Title: MUltiSite Cloud (MUSIC) data management
Inria principal investigator: Esther Pacitti
International Partner (Institution - Laboratory - Researcher):
Laboratorio Nacional de Computaçao Cientifica, Petropolis (Brazil) - Fabio Porto
Universidade Federal do Rio de Janeiro (Brazil) - Alvaro Coutinho and Marta Mattoso
Universidade Federal Fluminense, Niteroi (Brazil) - Daniel Oliveira
Centro Federal de Educa cao Tecnologica, Rio de Janeiro (Brazil) - Eduardo Ogasawara
Duration: 2014 - 2016
See also: https://
The cloud has become a good match for managing big data since it provides unlimited computing, storage and network resources on demand. By centralizing all data in a large-scale data-center, the cloud significantly simplifies the task of system administration. But for scientific data, where different organizations may have their own data-centers, a distributed (multisite) cloud model where each site is visible from outside, is needed. The main objective of this research and scientific collaboration is to develop a multisite cloud architecture for managing and analyzing scientific data, including support for heterogeneous data; distributed scientific workflows, and complex big data analysis. The resulting architecture will enable scalable data management infrastructures that can be used to host a variety of scientific applications that benefit from computing, storage, and networking resources that span multiple data-centers.
Title: A hybrid P2P/cloud for big data
Inria principal investigator: Patrick Valduriez
International Partner :
University of California at Santa Barbara (USA) - Amr El Abbadi and Divy Agrawal
Duration: 2013 - 2015
See also: https://
The main objective of this research and scientific collaboration is to develop a hybrid architecture of a computational platform that leverages the cloud computing and the P2P computing paradigms. The resulting architecture will enable scalable data management and data analysis infrastructures that can be used to host a variety of next-generation applications that benefit from computing, storage, and networking resources that exist not only at the network core (i.e., data-centers) but also at the network edge (i.e., machines at the user level as well as machines available in CDNs – content distribution networks hosted in ISPs).
We have regular scientific relationships with research laboratories in
North America: Univ. of Waterloo (Tamer Özsu), Mc Gill, Montreal (Bettina Kemme).
Asia: National Univ. of Singapore (Beng Chin Ooi, Stéphane Bressan), Wonkwang University, Korea (Kwangjin Park)
Europe: Univ. of Amsterdam (Naser Ayat, Hamideh Afsarmanesh), Univ. of Madrid (Ricardo Jiménez-Periz), UPC Barcelona (Josep Lluis Larriba Pey, Victor Munoz)
North Africa: Univ. of Tunis (Sadok Ben-Yahia)
The Bigdatanet associated team takes part of the Inria@SiliconValley lab.
We are involved in the following international actions:
CNPq-Inria project Hoscar (HPC and data management, 2012-2015) with LNCC (Fabio Porto), UFC, UFRGS (Philippe Navaux), UFRJ (Alvaro Coutinho, Marta Mattoso) to work on data management in high performance computing environments.
Ruiming Tang (National University of Singapore) gave a seminar on “Quality and Price of Data” in January.
Xiao Bai (Yahoo Labs Barcelona) gave a seminar on “Improving the Efficiency of Multi-site Web Search Engines” in January.
Philippe Bonnet (IT University of Copenhagen) gave a seminar on CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack in May.
Antoine Chambille and Romain Colle (QuartetFS, Paris) gave a seminar on “In-Memory Analytics: Accelerating Business Performance” in June.
Divy Agrawal and Amr El Abbadi (UCSB, USA) gave keynote talks on “Emerging Technologies for Big Data Management and Analytics” and “Consistent, Elastic and Fault-Tolerant Management of Big Data in the Cloud”, respectively, in the Mastodons International Workshop on “Big Data Management and Crowd Sourcing towards Scientific Data” in Montpellier in June.
Bettina Kemme (McGill Univ., Canada) gave a seminar on “Multiplayer Games: a complex application in need for scalable replica management” in december.
Sihem Amer-Yahia (LIG) gave a seminar on “Task Assignment Optimization in Crowdsourci” in December.
Patrick Valduriez visited the Inria-Chile center in Santiago in october, where he gave several talks.
Mohamed Reda Bouadjenek visited UCSB in november-december, in the context of the Bigdatanet associated team.
Participation in the editorial board of scientific journals:
VLDB Journal: P. Valduriez.
Journal of Transactions on Large Scale Data and Knowledge Centered Systems, R. Akbarinia.
Distributed and Parallel Databases, Kluwer Academic Publishers: E. Pacitti, P. Valduriez.
Internet and Databases: Web Information Systems, Kluwer Academic Publishers: P. Valduriez.
Journal of Information and Data Management, Brazilian Computer Society Special Interest Group on Databases: P. Valduriez.
Book series “Data Centric Systems and Applications” (Springer): P. Valduriez.
Ingénierie des Systèmes d'Information, Hermès : P. Valduriez.
Journal of Data Semantics (Springer): S. Cohen-Boulakia
Participation to the organization of conferences and workshops:
Florent Masseglia was vice-chair of the international conference on data mining (ICDM 2014).
Alexis Joly was chair of the LifeCLEF 2014 workshop (within CLEF 2014 conference) and co-organized the international workshop on Environmental Multimedia Retrieval (EMR 2014).
Esther Pacitti organized the Mastodons International Workshop on “Big Data Management and Crowd Sourcing towards Scientific Data” in Montpellier in June.
Patrick Valduriez co-organized (with Prof. Masaru Kitsuregawa, general director of NII, Japan) a workshop on big data sponsored by the French Embassy in Tokyo in November.
Participation in conference program committees :
ACM SIGMOD Conf. 2014: R. Akbarinia, P. Valduriez (area chair), 2015: S. Cohen-Boulakia
ADBIS East-European Conference on Advances in Databases and Information Systems: P. Valduriez (PC chair)
VLDB 2015 : P. Valduriez (sponsor co-chair)
ACM Symposium On Applied Computing (ACM SAC, Data Stream track), 2014: F. Masseglia
Conférence Internationale Francophone sur l'Extraction et la Gestion de Connaissance (EGC), 2014: F. Masseglia
IEEE Int. Conf. on Data Mining (ICDM), 2014: F. Masseglia (area chair)
IEEE Ph.D. forum at ICDM, 2014: F. Masseglia
Int. Conf. on Data Science and Advanced Analytics (DSAA), 2014: F. Masseglia
Int. Conf. on Data Management Technologies and Applications (DATA), 2014: F. Masseglia
Int. Conf. on Extending DataBase Technologies (EDBT), 2014: R. Akbarinia, A. Joly, F. Masseglia; 2015: E. Pacitti
IEEE Int. Conf. on Data Engineering (ICDE) 2014: P. Valduriez, 2015: S. Cohen-Boulakia
Data Integration in the Life Sciences (DILS) 2014: S. Cohen-Boulakia
BPM 2015 (Business Process Management (BPM) 2015: S. Cohen-Boulakia
ICDE 2015: Database and Semantic Web Workshop (DESWeb): P. Valduriez
International Conference and Labs of the Evaluation Forum (CLEF), 2014: A. Joly
Int. Workshop on Multimedia Analysis for Environmental Data (MAED), 2014: A. Joly
Int. Workshop on Human Centered Event Understanding from Multimedia (HuEvent), 2014: A. Joly
Int. Workshop Computer Vision for Analysis of Underwater Imagery (CVAU), 2014: A. Joly
Workshop on Theory and Practice of Provenance (TAPP) 2014 et 2015: S. Cohen-Boulakia
Sigmod Workshop on scalable workflow enactment engines and technologies (SWEET) 2014: S. Cohen-Boulakia
ICDT workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR) 2015: S. Cohen-Boulakia
Reviewing in international journals :
Knowledge and Information Systems (KAIS): F. Masseglia
Annals of Mathematics and Artificial Intelligence: F. Masseglia
Distributed and Parallel Databases: F. Masseglia
VLDB Journal: A. Joly
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI): A. Joly
IEEE transactions on multimedia: A. Joly
Journal of Mathematical Imaging and Vision (JMIV): A. Joly
Transactions on Knowledge and Data Engineering: A. Joly
EURASIP Journal on Image and Video Processing: A. Joly
Computer Vision and Image Understanding Journal: A. Joly
ACM Trans. on Database Systems: E. Pacitti
Other activities (national):
Zenith participates in the Hemera and Grid'5000 communities by: contributing to the definition of "Storage 5000" (we help defining the financial model for teams to have their data in Grid5000) ; participating to Hemera meetings (Florent Masseglia gave a talk on "Pattern Mining from Big Data, meeting of Sept. 18, Nantes); and contributing to the prospective document of Grid5000.
Patrick Valduriez co-organized the CEA-EDF-Inria Summer School on Big Data Analytics, 16-20 june, Cadarache, France, and gave a tutorial on “Distributed and Parallel Data Processing”.
Patrick Valduriez is the scientific manager for the Latin America zone at Inria Direction des Relations Internationales (DRI), Member of the Scientific Committee at Agence Nationale de la Recherche (ANR) - Défi 7 Information and communication society and Member of the Scientific Committee of the BDA conference.
Alexis Joly was keynote speaker of the scientific days of the University of Lyon 1 (2014).
Esther Pacitti gave an invited talk on “Profile Diversity for Query Processing using User Recommandations” at the Mastodons CrEDIBLE Workshop, Sophia-Antipolis, 9 October, and at the Seminar on Methods and Tools for Open Data, INRA, Montpellier, 18 December.
Other activities (international):
Florent Masseglia gave a talk on “Pattern Mining from Big Data”, to the French-Japanese workshop on Big Data in Tokyo in November.
Florent Masseglia was chair of the panel on "Privacy and Big Data", for the French-Japanese workshop on Big Data at Tokyo in November.
Patrick Valduriez gave several talks in international events: “CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language” at Inria-Technicolor Workshop on Storage and Processing of Big Data, Rennes, 3-4 december 2014, and at Workshop on DB Consistency in the Cloud, LIP6, Paris, 15-16 september 2014; “Cloud & Bigdata: opportunities and risks”, New Approaches to Economic Challenges” at OECD-ECLAC Workshop, Paris, 19 may 2014; and “Indexing and Processing Big Data” at Colloquium Indexing for Scientific Big data, Paris, 15 january 2014.
Esther Pacitti gave talks on search and recommendation at the MUSIC Workshops, COPPE-UFRJ, Rio de Janeiro, 8 August 2014, LNCC, Petropolis, Rio de Janeiro, 20 October 2014, and at the GDRI Workshop on Innovative Research Issues on Web Science, Toulouse, 10-12 September.
Sarak Cohen-Boulakia gave invited talks at Universities of Pennsylvania, April, and Humboldt, Berlin, December.
Most permanent members of Zenith teach at the Licence and Master degree levels at UM2.
Reza Akbarinia:
Master Research: Large scale data management, 6h, level M2, Faculty of Science, UM2
Licence: Computing Tools, 54h, Level L3, Faculty of Science, UM2
Florent Masseglia:
Master Research: Large scale data management, 3h, level M2, Faculty of Science, UM2
Summer School (EDF/Inria) on "Big Data Analytics": Pattern Minining from Big Data, 3h courses and 3h pratictal work, CEA, Cadarache.
Science Popularization: Jean-Philippe Bernard, Ph.D. student from doctoral school I2S is having a 30h doctoral module under Florent Masseglia's supervision.
Esther Pacitti:
IG3: Database design, physical organization, 54h, level L3, Polytech'Montpellier, UM2
IG4: Networks, 42h, level M1, Polytech' Montpellier, UM2
IG4: Object-relational databases, 32h, level M1, Polytech' Montpellier, UM2
IG5: Distributed systems, virtualization, 27h, level M2, Polytech' Montpellier, UM2
Industry internship committee, 50h, level M2, Polytech' Montpellier
Master Research: Large scale data management, 4,5h, level M2, Faculty of Science, UM2
Didier Parigot:
Master Research: Large scale data management, 6h, level M2, Faculty of Science, UM2
Patrick Valduriez:
Master Research: Large scale data management, 12h, level M2, Faculty of Science, UM2
Professional: Distributed Information Systems, 50h, level M2, Capgemini Institut
Professional: XML, 40h, level M2, Orsys Formation
Alexis Joly:
Master Research: Large scale data management, 6h, level M2, Faculty of Science, UM2
PhD in progress: Mehdi Zitouni Closed Pattern Mining in a Massively Distributed Environment started Sept. 2014, Univ. Tunis, Advisor: Florent Masseglia, co-advisor: Reza Akbarinia
PhD in progress : Ji Liu, Scientific Worflows in Multisite Cloud, started oct. 2013, Univ. Montpellier 2, Advisors: Esther Pacitti and Patrick Valduriez
PhD in progress : Saber Salah, Optimizing a Cloud for Data Mining Primitives, started nov. 2012, Univ. Montpellier 2, Advisor: Florent Masseglia, co-advisor: Reza Akbarinia
PhD in progress : Valentin Leveau, Supervised content-based information retrieval in big multimedia data, started April 2013, Univ. Montpellier 2, Advisor: Patrick Valduriez, co-advisor: Alexis Joly and Olivier Buisson
PhD in progress : Djamel-Edine Yagoubi, Indexing Time Series in a Massively Distributed Environment, started October 2014, Univ. Montpellier 2, Advisors: Florent Masseglia and Patrick Valduriez, co-advisor: Reza Akbarinia
Members of the team participated to the following Ph.D. committees:
F. Masseglia: Andres Moreno (University of Nice/Sophia-Antipolis, reviewer).
R. Akbarinia: Maximilien Servajean (UM2), Jesús Camacho Rodriguez (Univ. Paris-Sud).
E. Pacitti: Julien Gaillard (Univ. Avignon); Lourdes Angelica Martinez Medina (Univ. Grenoble).
P. Valduriez: Li Feng (National University of Singapore, reviewer), Fan Quinfeng (UVSQ, reviewer), Pierpaolo Cincilla (UPMC Paris 6, reviewer), Radu Tudoran (ENS Rennes, chair), Laurent Amsaleg (HDR, U. Rennes 1).
Today, one of the main questions in science popularization and computer science teaching at school in France is: "how to scale?". Some of our recent actions in the domaine are mainly oriented towards this goal of scaling.
F. Masseglia has coordinated a national network of colleagues for promoting code learning. This action has given the "KidEtCode" studios of Inria, and the involved colleagues have benefited from two training sessions on the subject. After some successful organizations of studios by colleagues in the network, F. Masseglia has written a series of interviews where they give a feedback on "howto" set up a KidEtCode studio, find partners and provide relevant content.
F. Masseglia has given a 3 days training to elementary school
teachers. This event, called “Graines de sciences” (Ecole de
Physique des Houches,
http://
Following this autumn school for teachers, "La main à la pâte" is now working with Inria (Gilles Dowek, Pierre-Yves Oudeyer, Florent Masseglia and Didier Roy), "France-IOI" and the University of Lorraine in order to write a school book on computer science teaching.
Zenith participated to the following events in Montpellier, and therefore participated in the effort of relaunching science popularization activities in Montpellier:
F. Masseglia co-organized and co-animated the Inria's stand at "La fête de la science" (Montpellier), held by Genopolys (a science village).
F. Masseglia organized and animated several "Kid&Code" studios in the greater metropolitan area of Montpellier, involving the network of media libraries and the network of extracurricular activities. He also has proposed a two sessions training to media libraries activity leaders. The goal is to accompany media libraries in the region of Montpellier and help them until they are autonomous in this activity.
F. Masseglia is a member of the steering committee, and contributor,
of interstices ( https://
F. Masseglia is a member of the management board of "Les Petits Débrouillards" in Languedoc-Roussillon. He also is the scientific responsible for schools visits in the Lirmm Laboratory.
A. Joly presented the Pl@ntNet project and the associated iPhone app in several events including "Futur en Seine 2014" where it received the price of collaborative research, "les rencontres Inria/industries" , "LeWeb 2014", etc. Many news media articles, blogs, TV news and twitts also reported on Pl@ntNet (typing "Pl@ntNet" on Google is the best way to get most of them).
P. Valduriez gave a talk on “Cloud Big Data” at Université du Tiers Temps, Montpellier.