Section: New Software and Platforms


Keywords: Hadoop - RDF - SPARQL

Scientific Description

SPARQL is the W3C standard query language for querying data expressed in RDF (Resource Description Framework). The increasing amounts of RDF data available raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators.

In this context, we propose and share RDFHive: a simple implementation of a distributed RDF datastore benefiting from Apache Hive. RDFHive is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries. RDFHive relies on a translation of SPARQL queries into SQL queries that Hive is able to evaluate.

Technically, RDFHive directly evaluates SPARQL queries i.e. there is no preprocessing step, indeed an RDF triple file is seen by Hive as a three-column table. Thus, the bash translator simply translates SPARQL queries according to this scheme. This method has two advantages: first, creating a database is very fast, second, since the upfront investment is light, RDFHive is an interesting tool to evaluate a few SPARQL queries at once.