EN FR
EN FR


Section: Scientific Foundations

Managing massive unstructured data under heavy concurrency on large-scale distributed infrastructures

Massive unstructured data: BLOBs

Studies show more than 80%  [53] of data globally in circulation is unstructured. On the other hand, data sizes increase at a dramatic level with more than 1 TB of data gathered per week in common scenarios for some production applications (e.g., medical experiments  [65] ). Finally, on Post-Petascale HPC machines, the use of huge storage objects is also currently being considered as a promising alternative to today's dominant approaches to data management. Indeed, these approaches rely on very large numbers of small files, and using huge storage objects reduces the corresponding metadata overhead of the file system. Such huge unstructured data are stored as binary large objects (BLOBs) that may continuously be updated by applications. However, traditional databases or file systems can hardly cope in an efficient way with BLOBs which grow to huge sizes.

Scalable processing of massive data: heavy access concurrency

To address the scalability issue, specialized abstractions like MapReduce  [47] and Pig-Latin  [63] propose high-level data processing frameworks intended to hide the details of parallelization from the user. Such platforms are implemented on top of huge object storage platforms. They target high performance by optimizing the parallel execution of the computation. This leads to heavy access concurrency to the BLOBs, thus the need for the storage layer to offer support in this regard. Parallel and distributed file systems also consider using objects for low-level storage (see next subsection  [48] , [69] , [51] ). In other application areas, huge BLOBs need to be used concurrently at the highest level layers of applications directly: high-energy physics, multimedia processing  [46] or astronomy.

Versioning

When addressing the problem of storing and efficiently accessing very large unstructured data objects  [60] , [65] in a distributed environment, a challenging case is the one where data is mutable and potentially accessed by a very large number of concurrent, distributed processes. In this context, versioning is an important feature. Not only it allows to roll back data changes when desired, but it also enables cheap branching (possibly recursively): the same computation may proceed independently on different versions of the BLOB. Versioning should obviously not impact access performance to the object significantly, given that objects are under constant heavy access concurrency. On the other hand, versioning leads to increased storage space usage and becomes a major concern when the data size itself is huge. Versioning efficiency thus refers to both access performance under heavy load and reasonably acceptable overhead of storage space.