Big data analytics
The amount of data digitally produced is increasing at an exponential
rate.
Having a
dedicated programming model and runtime, such as Hadoop-MapReduce, has
proved very useful to build efficient big data mining and analysis applications albeit
for very static environments. However, if we consider that not only
the environment is dynamic (node sharing, failures...) but so are the data
(variation in popularity, arrival rate...), it
becomes a much more complex problem. This domain is thus a very good
candidate as an application field for our work.
More precisely, we plan to
contribute at the deployment level, runtime level, and at the
analytics programming model for the end-user level. We already worked on close
topics with the distributed P2P storage and publish/subscribe system for Semantic Web data (named EventCloud).
However, expressing a particular interest about data through simple or even more complex subscriptions (CEP) is only a first
step in data analytics. Going further requires the full expressivity of a programming language to express how
to mine into the real-time data streams, aggregate intermediate analytics results, combine with past data when relevant, etc.
We intend to enlarge this effort about extracting meaningful information by also creating
tighter collaborations with groups specialized in data mining algorithms (e.g. the Mind team at I3S).
We think that the approach advocated in Scale is particularly
adapted to the programming and support of analytics. Indeed, the mix
of computational aspects and of large amount of data make the
computation of analytics the perfect target for our programming
paradigms. We aim at illustrating the effectiveness of our approach by
experimenting on different computations of analytics, but we will put
a particular focus on the case of data streams, where the analysis is
made of chains (even cyclic graphs) of parallel and distributed
operators. These operators can naturally be expressed as coarse
grained composition of fine grained parallel entities, both
granularity levels featuring autonomic adaptation.
Also, the underlying execution platform that supports this execution
also has to feature autonomic adaptation in order to deal with an
unstable and heterogeneous execution environment. Here autonomic
adaptation is also crucial because the programmer of analytics is not
expected to be an expert in distributed systems.
Overall, this second application domain target should illustrate the
effectiveness of our runtime platform and of our methodology for
dynamic and autonomic adaptation.