Query Workload-Driven Schema Optimization For Processing Large RDF Datasets
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
In the world we live in, data are not only increasing in volume, but they are also becoming
more and more interconnected and linked. In many areas of our daily lives, such as
social media, computational biology and protein networks, telecommunications, and
many others, graph data models are the most natural, easy-to-understand, and versatile
data abstraction to represent the world’s structured knowledge. In fact, the information
retrieved via natural language processing and computer vision is currently being represented
mostly by Knowledge Graphs (KGs). KGs are efficient means to represent,
integrate and connect data from several heterogeneous data sources. Those applications
led to a surge in the popularity of KGs. However, on the other side, this brings computational
challenges because KGs are growing in massive volumes. Specifically, several
applications have used the standard Resource Description Framework (RDF) graph data
model to represent, share, and integrate pieces of data on the web.
Therefore, the Semantic Web (SW) community’s central problem for managing
scalable RDF KGs is now in demand. The native graph databases (e.g., Apache Jena,
RDF-3X, and Virtuoso) fall short of managing and processing large RDF datasets due
to their centralized computational paradigm, i.e., they cannot scale out. Thus, the SW
community has started to investigate relational Big Data (BD) frameworks harnessing
their scalability and efficiency. Relational systems get a lot of their efficient performance
from sophisticated optimizers that leverage relational model, relational algebra simplicity,
and maturity. Despite the flexibility of the relational solutions, the flexible (i.e., schemaless)
structure of RDF graphs brings challenges to store and manage RDF graphs in
relational schemas. The state-of-the-art shows that there is no “One-Size-Fits-All” RDF
relational schema that can fit all the query workloads. In particular, there is a different
winner of RDF relational schema by a large margin for each query type, and the winner
in one query family may unexpectedly perform the worst in another.
In this thesis, we argue that combining multiple RDF relational schemas to attain a
hybrid one provides better performance for the BD system while querying large KGs.
Nevertheless, designing hybrid schema solutions for schema-less KGs require huge data
engineering efforts and tailored solutions. To this end, this thesis proposes algorithms that
automatically design a hybrid RDF relational schema that adapts to the query workload
covering a wide range of query types, without ignoring the loading times, as well as
the storage overheads. In particular, we approach this goal via data profiling along
with query profiling seeking better data localization and combining relevant data that
frequently queried together on the same relations. Our approach reaches to an optimal
hybrid schema that consider both the underlying data relationships, as well as the query workloads.
Description
Keywords
Large RDF Graphs, SPARQL, Spark-SQL, RDF Relational Schema, Workload Driven