Query Workload-Driven Schema Optimization For Processing Large RDF Datasets

dc.contributor.advisorRagab, Mohamed, juhendaja
dc.contributor.advisorTommasini, Riccardo, juhendaja
dc.contributor.advisorNolte, Alexander, juhendaja
dc.contributor.authorValiyev, Farid
dc.contributor.otherTartu Ülikool. Loodus- ja täppisteaduste valdkondet
dc.contributor.otherTartu Ülikool. Arvutiteaduse instituutet
dc.date.accessioned2023-10-24T11:41:30Z
dc.date.available2023-10-24T11:41:30Z
dc.date.issued2023
dc.description.abstractIn the world we live in, data are not only increasing in volume, but they are also becoming more and more interconnected and linked. In many areas of our daily lives, such as social media, computational biology and protein networks, telecommunications, and many others, graph data models are the most natural, easy-to-understand, and versatile data abstraction to represent the world’s structured knowledge. In fact, the information retrieved via natural language processing and computer vision is currently being represented mostly by Knowledge Graphs (KGs). KGs are efficient means to represent, integrate and connect data from several heterogeneous data sources. Those applications led to a surge in the popularity of KGs. However, on the other side, this brings computational challenges because KGs are growing in massive volumes. Specifically, several applications have used the standard Resource Description Framework (RDF) graph data model to represent, share, and integrate pieces of data on the web. Therefore, the Semantic Web (SW) community’s central problem for managing scalable RDF KGs is now in demand. The native graph databases (e.g., Apache Jena, RDF-3X, and Virtuoso) fall short of managing and processing large RDF datasets due to their centralized computational paradigm, i.e., they cannot scale out. Thus, the SW community has started to investigate relational Big Data (BD) frameworks harnessing their scalability and efficiency. Relational systems get a lot of their efficient performance from sophisticated optimizers that leverage relational model, relational algebra simplicity, and maturity. Despite the flexibility of the relational solutions, the flexible (i.e., schemaless) structure of RDF graphs brings challenges to store and manage RDF graphs in relational schemas. The state-of-the-art shows that there is no “One-Size-Fits-All” RDF relational schema that can fit all the query workloads. In particular, there is a different winner of RDF relational schema by a large margin for each query type, and the winner in one query family may unexpectedly perform the worst in another. In this thesis, we argue that combining multiple RDF relational schemas to attain a hybrid one provides better performance for the BD system while querying large KGs. Nevertheless, designing hybrid schema solutions for schema-less KGs require huge data engineering efforts and tailored solutions. To this end, this thesis proposes algorithms that automatically design a hybrid RDF relational schema that adapts to the query workload covering a wide range of query types, without ignoring the loading times, as well as the storage overheads. In particular, we approach this goal via data profiling along with query profiling seeking better data localization and combining relevant data that frequently queried together on the same relations. Our approach reaches to an optimal hybrid schema that consider both the underlying data relationships, as well as the query workloads.et
dc.identifier.urihttps://hdl.handle.net/10062/93711
dc.language.isoenget
dc.publisherTartu Ülikoolet
dc.rightsopenAccesset
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectLarge RDF Graphset
dc.subjectSPARQLet
dc.subjectSpark-SQLet
dc.subjectRDF Relational Schemaet
dc.subjectWorkload Drivenet
dc.subject.othermagistritöödet
dc.subject.otherinformaatikaet
dc.subject.otherinfotehnoloogiaet
dc.subject.otherinformaticset
dc.subject.otherinfotechnologyet
dc.titleQuery Workload-Driven Schema Optimization For Processing Large RDF Datasetset
dc.typeThesiset

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Valiyev_MSc_CS_2023.pdf
Size:
1.3 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: