Loading...
Thumbnail Image
Publication

Workload-driven data partitioning for data-intensive serverless computing

Saka, Umut Mete
Research Projects
Organizational Units
Journal Issue
Embargo Expires
Abstract
Cloud computing offers promising applications for large-scale data analysis due to its high elasticity and reduced upfront costs compared to traditional hardware infrastructure. Typically, running analytics workloads in the cloud requires detailed provisioning, which includes setting up a dedicated database instance. However, these workloads can be bursty or sporadic, leading to either underutilization or overloading of the provisioned systems. Serverless computing emerges as a solution by enhancing auto-scalability, by delegating the server provisioning and management to the cloud provider, and reducing costs by charging users on a pay-per-execution basis. Yet, serverless functions, particularly those for data-intensive applications, face challenges such as limited memory, statelessness, lack of local storage forcing serverless applications to transfer substantial amounts of data from remote storage to the execution environment, resulting in significant data transfer overhead. Additionally, the architectural decisions of cloud providers often complicate potential solutions. In this thesis, we introduce a workload-driven data partitioning system designed to determine efficient partitioning layouts with the goal of minimizing data transfer. The architecture efficiently records historical accesses, analyzes patterns, and adjusts data partitions accordingly to optimize data locality and minimize unnecessary data transfers. Moreover, this thesis presents a benchmark specifically designed to measure the impact of various data partitioning strategies and execution environment configurations on system performance and operational costs. Results demonstrate that workload-driven data partitioning significantly reduces the amount of data transferred during computation, thereby decreasing latency and lowering costs. Furthermore, we show that cost and latency improvement magnifies in larger datasets and high frequency workload.
Associated Publications
Rights
Copyright of the original work is retained by the author.
Embedded videos