Towards Auto-Scaling of Serverless Data Pipelines
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
The ever-increasing number of IoT devices generates massive data, and collecting data
from heterogeneous sources and processing it without any bottleneck is challenging.
Data pipelines are heavily used for automated data processing without any manual hassle.
The traditional Data pipelines, such as Extract-Load-Transform, has its own challenges,
which are difficult to scale and reduce the timeliness of data processing. It can be solved
with the use of serverless computing. Serverless computing is a recent paradigm in
cloud computing, It offers granular level scaling of the functions compared to the Virtual
Machine (VM). With the increase of smart and Internet of Things(IoT) devices, the
use of data pipeline is increased exponentially. However, stochastic IoT workloads and
assuring Quality of Service metrics (Latency, throughput, etc.) impose several challenges,
including scaling of the underlying infrastructure. Serverless Data Pipelines(SDP) can
be designed to process high data volume with efficient resource usage. SDPs comprise
several components like serverless functions, message queues, and queue connectors.
Scaling the entire pipeline without leaving any bottlenecks is challenging. In our study,
we created a serverless data pipeline for an Image Processing IoT application that uses
serverless functions to execute the data operation tasks. We also applied different reactive
scaling mechanisms, such as resource-based scaling and Workload based scaling, to
measure the performance of the scalability on the serverless data pipeline. The reactive
mechanisms consider single metrics to enforce auto-scaling configuration, i.e. CPU
usage or Request rate. Therefore, we evaluated the use of multiple performance metrics
of the Serverless data Pipeline to proactively predict the number of serverless functions in
the data pipeline. To experiment with this, we collected data by configuring the reactive
auto-scalers, cleaning them to remove outliers, and using them for training and testing
the proactive auto-scaler. In this work, we used multioutput regression models, and the
results show that the ExtraTreeRegressor algorithm has better efficiency in predicting the
pods.
Description
Keywords
Cloud Computing, Serverless Functions, Function as a Service (FaaS), Data Pipelines