Query Workload-Driven Schema Optimization For Processing Large RDF Datasets

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Tartu Ülikool

Abstract

In the world we live in, data are not only increasing in volume, but they are also becoming more and more interconnected and linked. In many areas of our daily lives, such as social media, computational biology and protein networks, telecommunications, and many others, graph data models are the most natural, easy-to-understand, and versatile data abstraction to represent the world’s structured knowledge. In fact, the information retrieved via natural language processing and computer vision is currently being represented mostly by Knowledge Graphs (KGs). KGs are efficient means to represent, integrate and connect data from several heterogeneous data sources. Those applications led to a surge in the popularity of KGs. However, on the other side, this brings computational challenges because KGs are growing in massive volumes. Specifically, several applications have used the standard Resource Description Framework (RDF) graph data model to represent, share, and integrate pieces of data on the web. Therefore, the Semantic Web (SW) community’s central problem for managing scalable RDF KGs is now in demand. The native graph databases (e.g., Apache Jena, RDF-3X, and Virtuoso) fall short of managing and processing large RDF datasets due to their centralized computational paradigm, i.e., they cannot scale out. Thus, the SW community has started to investigate relational Big Data (BD) frameworks harnessing their scalability and efficiency. Relational systems get a lot of their efficient performance from sophisticated optimizers that leverage relational model, relational algebra simplicity, and maturity. Despite the flexibility of the relational solutions, the flexible (i.e., schemaless) structure of RDF graphs brings challenges to store and manage RDF graphs in relational schemas. The state-of-the-art shows that there is no “One-Size-Fits-All” RDF relational schema that can fit all the query workloads. In particular, there is a different winner of RDF relational schema by a large margin for each query type, and the winner in one query family may unexpectedly perform the worst in another. In this thesis, we argue that combining multiple RDF relational schemas to attain a hybrid one provides better performance for the BD system while querying large KGs. Nevertheless, designing hybrid schema solutions for schema-less KGs require huge data engineering efforts and tailored solutions. To this end, this thesis proposes algorithms that automatically design a hybrid RDF relational schema that adapts to the query workload covering a wide range of query types, without ignoring the loading times, as well as the storage overheads. In particular, we approach this goal via data profiling along with query profiling seeking better data localization and combining relevant data that frequently queried together on the same relations. Our approach reaches to an optimal hybrid schema that consider both the underlying data relationships, as well as the query workloads.

Description

Keywords

Large RDF Graphs, SPARQL, Spark-SQL, RDF Relational Schema, Workload Driven

Citation