Design and Implementation of an Incremental ELT Pipeline for a Jira Data Warehouse using Data Vault 2.0 Methodology and HP Vertica

Date

2023

Authors

Bobkov, Rasmus

Journal Title

Journal ISSN

Volume Title

Publisher

Tartu Ülikool

Abstract

This master’s thesis outlines the design and implementation of a containerized ELT pipeline for TEHIK, a company requiring an efficient way to analyze Jira Software data. The pipeline is designed to incrementally load data into a Vertica DWH, constructed following DV 2.0 principles. The containerized architecture enables easy deployment in production environments. Considering the extensive breadth of the subject, the thesis aims to provide an overarching understanding of DE, DV 2.0, Agile methodologies, and implementation. Instead of delving into intricate specifics of each area, it focuses on presenting a broad perspective, offering a more comprehensive view of these fields. The thesis begins by examining the current system, underlining its limitations, and then introduces the proposed solution, emphasizing its advantages. The Background Knowledge and Related Work section endeavors to provide a solid understanding of the central concepts in DE, DWH’ing, and the DV 2.0 methodology, along with deployment in production environments. This section touches upon key topics such as ingestion, ELT vs ETL architecture, DWH architectures, and the essence and benefits of the DV 2.0 methodology. While the practical application of Kubernetes, logging, monitoring, and orchestration with Airflow is not included in the thesis due to time restrictions, these aspects are still crucial for a holistic understanding of the project. Hence, a conceptual overview of orchestration using Airflow and a theoretical implementation for logging and monitoring are provided. The implementation section comprehensively explores the project’s process, unveiling the specific steps and methodologies employed, the challenges faced, and their respective solutions. The subsequent ’Results and Analysis’ section critically compares the proposed solution and the existing one. It evaluates aspects like reporting capabilities, compliance with SLAs, and an analysis of the pipeline’s performance, considering its ability to handle large data volumes and scalability. In conclusion, this thesis delivers a robust, scalable, and efficient solution comprising an ELT pipeline and a DV 2.0-based DWH tailored for TEHIK’s Jira Software data analysis needs. This integrated solution outperforms the existing system, providing a solid foundation for future enhancements and expansions.

Description

Keywords

ELT pipeline, Data Engineering, DWH, HP Vertica, Jira, TEHIK, DV 2.0, Meltano, shell scripting, vsql, sofware development, Docker, scalability

Citation