Query Workload-Driven Schema Optimization For Processing Large RDF Datasets

Valiyev, Farid

Query Workload-Driven Schema Optimization For Processing Large RDF Datasets

dc.contributor.advisor	Ragab, Mohamed, juhendaja
dc.contributor.advisor	Tommasini, Riccardo, juhendaja
dc.contributor.advisor	Nolte, Alexander, juhendaja
dc.contributor.author	Valiyev, Farid
dc.contributor.other	Tartu Ülikool. Loodus- ja täppisteaduste valdkond	et
dc.contributor.other	Tartu Ülikool. Arvutiteaduse instituut	et
dc.date.accessioned	2023-10-24T11:41:30Z
dc.date.available	2023-10-24T11:41:30Z
dc.date.issued	2023
dc.description.abstract	In the world we live in, data are not only increasing in volume, but they are also becoming more and more interconnected and linked. In many areas of our daily lives, such as social media, computational biology and protein networks, telecommunications, and many others, graph data models are the most natural, easy-to-understand, and versatile data abstraction to represent the world’s structured knowledge. In fact, the information retrieved via natural language processing and computer vision is currently being represented mostly by Knowledge Graphs (KGs). KGs are efficient means to represent, integrate and connect data from several heterogeneous data sources. Those applications led to a surge in the popularity of KGs. However, on the other side, this brings computational challenges because KGs are growing in massive volumes. Specifically, several applications have used the standard Resource Description Framework (RDF) graph data model to represent, share, and integrate pieces of data on the web. Therefore, the Semantic Web (SW) community’s central problem for managing scalable RDF KGs is now in demand. The native graph databases (e.g., Apache Jena, RDF-3X, and Virtuoso) fall short of managing and processing large RDF datasets due to their centralized computational paradigm, i.e., they cannot scale out. Thus, the SW community has started to investigate relational Big Data (BD) frameworks harnessing their scalability and efficiency. Relational systems get a lot of their efficient performance from sophisticated optimizers that leverage relational model, relational algebra simplicity, and maturity. Despite the flexibility of the relational solutions, the flexible (i.e., schemaless) structure of RDF graphs brings challenges to store and manage RDF graphs in relational schemas. The state-of-the-art shows that there is no “One-Size-Fits-All” RDF relational schema that can fit all the query workloads. In particular, there is a different winner of RDF relational schema by a large margin for each query type, and the winner in one query family may unexpectedly perform the worst in another. In this thesis, we argue that combining multiple RDF relational schemas to attain a hybrid one provides better performance for the BD system while querying large KGs. Nevertheless, designing hybrid schema solutions for schema-less KGs require huge data engineering efforts and tailored solutions. To this end, this thesis proposes algorithms that automatically design a hybrid RDF relational schema that adapts to the query workload covering a wide range of query types, without ignoring the loading times, as well as the storage overheads. In particular, we approach this goal via data profiling along with query profiling seeking better data localization and combining relevant data that frequently queried together on the same relations. Our approach reaches to an optimal hybrid schema that consider both the underlying data relationships, as well as the query workloads.	et
dc.identifier.uri	https://hdl.handle.net/10062/93711
dc.language.iso	eng	et
dc.publisher	Tartu Ülikool	et
dc.rights	openAccess	et
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Large RDF Graphs	et
dc.subject	SPARQL	et
dc.subject	Spark-SQL	et
dc.subject	RDF Relational Schema	et
dc.subject	Workload Driven	et
dc.subject.other	magistritööd	et
dc.subject.other	informaatika	et
dc.subject.other	infotehnoloogia	et
dc.subject.other	informatics	et
dc.subject.other	infotechnology	et
dc.title	Query Workload-Driven Schema Optimization For Processing Large RDF Datasets	et
dc.type	Thesis	et

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1

Nimi:: Valiyev_MSc_CS_2023.pdf
Suurus:: 1.3 MB
Formaat:: Adobe Portable Document Format
Kirjeldus:

Lae alla

Litsentsi pakett

Nüüd näidatakse 1 - 1 1

Nimi:: license.txt
Suurus:: 1.71 KB
Formaat:: Item-specific license agreed upon to submission
Kirjeldus:

Lae alla

Kollektsioonid

LTAT magistritööd – Master's theses