Large RDF Graph Processing on Top of Spark
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
In recent years, we have witnessed an uncontrollable growth of data generated by
machines or humans. Big Data is a term used to indicate data-related challenges.
Although several challenges have been identi fed for big data, main ones remain vol-
ume, velocity, and variety. Volume is related to the large quantity of data. Velocity is
related to the high rates at which the data is generated and processed. Last but not
least, the variety is related to the presence of multiple data formats. Although there
are many solutions to handle the data variety issue, the most popular one is the RDF
(Resource Description Framework) data model. RDF is a W3C standard for Semantic
Web, and many web applications are built on top of the RDF data model using
a SPARQL query language. Thus, RDF data's continuous growth leads to investigate
how to handle large RDF datasets in a distributed environment. Apache Spark
is a modern, high-performance big data engine for processing vast amounts of data
in a distributed environment. Big data systems like Apache Spark are not tailored
for dealing with RDF data models; however, they have an excellent performance for
large-scale relational data processing. Therefore, we implement the SPARQL queries
over RDF data using Spark-SQL.
In this thesis, we use existing relational approaches for storing RDF data in Spark
DataFrame data abstraction. We present a systematic performance evaluation of the
Spark-SQL engine for processing SPARQL queries on the SP2Bench benchmark. In
particular, we used three relevant relational schemes, two storage backends, and several
le formats. We have also applied three different partitioning techniques to see
how it affects the Spark-SQL query execution performance. Finally, a major contribution
of this thesis is an advanced analysis of experimental results and a discussion
about the impact of each dimension (i.e. relational schema, partitioning technique,
storage backend) on the performance of the query execution process in the distributed
environment of Spark.
Description
Keywords
Large RDF Graphs, SPARQL, Spark-SQL, RDF Relational Schema