Design and Implementation of an Incremental ELT Pipeline for a Jira Data Warehouse using Data Vault 2.0 Methodology and HP Vertica
Date
2023
Authors
Bobkov, Rasmus
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
This master’s thesis outlines the design and implementation of a containerized ELT
pipeline for TEHIK, a company requiring an efficient way to analyze Jira Software data.
The pipeline is designed to incrementally load data into a Vertica DWH, constructed
following DV 2.0 principles. The containerized architecture enables easy deployment in
production environments. Considering the extensive breadth of the subject, the thesis
aims to provide an overarching understanding of DE, DV 2.0, Agile methodologies, and
implementation. Instead of delving into intricate specifics of each area, it focuses on
presenting a broad perspective, offering a more comprehensive view of these fields.
The thesis begins by examining the current system, underlining its limitations, and
then introduces the proposed solution, emphasizing its advantages. The Background
Knowledge and Related Work section endeavors to provide a solid understanding of the
central concepts in DE, DWH’ing, and the DV 2.0 methodology, along with deployment
in production environments. This section touches upon key topics such as ingestion, ELT
vs ETL architecture, DWH architectures, and the essence and benefits of the DV 2.0
methodology.
While the practical application of Kubernetes, logging, monitoring, and orchestration
with Airflow is not included in the thesis due to time restrictions, these aspects are still
crucial for a holistic understanding of the project. Hence, a conceptual overview of
orchestration using Airflow and a theoretical implementation for logging and monitoring
are provided.
The implementation section comprehensively explores the project’s process, unveiling
the specific steps and methodologies employed, the challenges faced, and their respective
solutions. The subsequent ’Results and Analysis’ section critically compares the proposed
solution and the existing one. It evaluates aspects like reporting capabilities, compliance
with SLAs, and an analysis of the pipeline’s performance, considering its ability to handle
large data volumes and scalability.
In conclusion, this thesis delivers a robust, scalable, and efficient solution comprising
an ELT pipeline and a DV 2.0-based DWH tailored for TEHIK’s Jira Software data
analysis needs. This integrated solution outperforms the existing system, providing a
solid foundation for future enhancements and expansions.
Description
Keywords
ELT pipeline, Data Engineering, DWH, HP Vertica, Jira, TEHIK, DV 2.0, Meltano, shell scripting, vsql, sofware development, Docker, scalability