Large RDF Graph Processing on Top of Spark

Eyvazov, Sadig

Large RDF Graph Processing on Top of Spark

dc.contributor.advisor	Ragab, Mohammed, juhendaja
dc.contributor.advisor	Tommasini, Riccardo, Dr., juhendaja
dc.contributor.author	Eyvazov, Sadig
dc.contributor.other	Tartu Ülikool. Loodus- ja täppisteaduste valdkond	et
dc.contributor.other	Tartu Ülikool. Arvutiteaduse instituut	et
dc.date.accessioned	2023-09-08T07:28:14Z
dc.date.available	2023-09-08T07:28:14Z
dc.date.issued	2021
dc.description.abstract	In recent years, we have witnessed an uncontrollable growth of data generated by machines or humans. Big Data is a term used to indicate data-related challenges. Although several challenges have been identi fed for big data, main ones remain vol- ume, velocity, and variety. Volume is related to the large quantity of data. Velocity is related to the high rates at which the data is generated and processed. Last but not least, the variety is related to the presence of multiple data formats. Although there are many solutions to handle the data variety issue, the most popular one is the RDF (Resource Description Framework) data model. RDF is a W3C standard for Semantic Web, and many web applications are built on top of the RDF data model using a SPARQL query language. Thus, RDF data's continuous growth leads to investigate how to handle large RDF datasets in a distributed environment. Apache Spark is a modern, high-performance big data engine for processing vast amounts of data in a distributed environment. Big data systems like Apache Spark are not tailored for dealing with RDF data models; however, they have an excellent performance for large-scale relational data processing. Therefore, we implement the SPARQL queries over RDF data using Spark-SQL. In this thesis, we use existing relational approaches for storing RDF data in Spark DataFrame data abstraction. We present a systematic performance evaluation of the Spark-SQL engine for processing SPARQL queries on the SP2Bench benchmark. In particular, we used three relevant relational schemes, two storage backends, and several le formats. We have also applied three different partitioning techniques to see how it affects the Spark-SQL query execution performance. Finally, a major contribution of this thesis is an advanced analysis of experimental results and a discussion about the impact of each dimension (i.e. relational schema, partitioning technique, storage backend) on the performance of the query execution process in the distributed environment of Spark.	et
dc.identifier.uri	https://hdl.handle.net/10062/92013
dc.language.iso	eng	et
dc.publisher	Tartu Ülikool	et
dc.rights	openAccess	et
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Large RDF Graphs	et
dc.subject	SPARQL	et
dc.subject	Spark-SQL	et
dc.subject	RDF Relational Schema	et
dc.subject.other	magistritööd	et
dc.subject.other	informaatika	et
dc.subject.other	infotehnoloogia	et
dc.subject.other	informatics	et
dc.subject.other	infotechnology	et
dc.title	Large RDF Graph Processing on Top of Spark	et
dc.type	Thesis	et

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: SadigEyvazov_MasterThesis_CS.pdf
Suurus:: 1.31 MB
Formaat:: Adobe Portable Document Format
Kirjeldus:

Lae alla

Litsentsi pakett

Nüüd näidatakse 1 - 1 1

Nimi:: license.txt
Suurus:: 1.71 KB
Formaat:: Item-specific license agreed upon to submission
Kirjeldus:

Lae alla

Kollektsioonid

LTAT magistritööd – Master's theses