Self-serve Data Architecture with Presto and Alluxio Across Clouds

February 8, 2022

Adit Madan

This article highlights synergy between the two widely adopted open-source projects, Alluxio and Presto, and demonstrates how together they deliver a self-serve data architecture across clouds.

What makes an architecture self-serve?

Condition 1: Evolution of the data platform does not require changes

All data platforms evolve over time, including addition of a new data store, compute engine, or a new team which needs to access shared data. In either case, a data platform is self-serve if it does not require changes to accommodate evolution.

Condition 2: Isolation across Teams

Business units don’t step on each other with a self-serve platform. When a new team is introduced, data access by one team should have no impact on existing usage of the shared data infrastructure.

The combination of the above two offers agility, which oftentimes is more important than the cost of physical infrastructure.

Data Platform Considerations

Below, we introduce some considerations when designing a self-serve platform, and architectural patterns for simple solutions.

Consideration 1: Data is shared

Between Compute Frameworks
- There are a large number of specialized compute engines. Each engine is better suited for a specific task, which means there is a need to share data between engines. For example, ETL in a batch processing followed by Presto for interactive queries.
Between Different Teams
- For example, a team is responsible for collection of operational data which is then consumed by multiple other business units.
Between Data Centers Across Regions and Cloud Providers
- This offers the flexibility to choose the most optimal service across environments.

The solution for shared data is to have an abstraction layer across heterogeneous compute. Alluxio provides such an abstraction across clouds for seamless sharing of data between Presto and other compute engines regardless of the data store.

Consideration 2: Data has ownership domains and processing in place is simple

Although replication provides isolation, governance becomes complex as the owner of data enforces strict policies about the consumption of data.
Copies introduce redundancy, which is error-prone and has high resource requirements.

It may seem obvious that a solution is to not make copies of data, but what about performance when we don’t move data? This calls for a single abstraction layer which takes care of governance, performance and movement of data across ownership domains.

The architecture below shows Presto using the Alluxio layer for access to data regardless of the location.

The above design can be broken down in a few simple cases

All within a single cloud or a datacenter
Shared across multiple datacenters or a hybrid cloud

In all these cases, the separation of the CONSUMER from the PRODUCER of data is enabled by an abstraction layer which provides more than a simple cache. Advanced preloading and write capabilities guarantee SLAs even with the separation of data from compute.

Conclusion:

With a self-serve data architecture across clouds, we construct a solution that stands the test of time as a data platform evolves. Learn more from the whitepaper Presto with Alluxio Overview – Architecture Evolution for Interactive Queries, and see how companies including Facebook, TikTok, Electronic Arts, Walmart, Tencent, Comcast, etc level up their current Presto platform leveraging Alluxio.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo