Starburst Presto Alluxio Better Together for Presto Caching

August 20, 2018

The following is a guest post from our friends at Starburst Data.

With more companies using Presto for reporting and analytics, we here at Starburst are seeing more use cases around operational reporting. These types of queries need to be returned subsecond and usually involve a small subset of the dataset.

Presto was designed from the ground up to offer interactive analytics using a massively parallel processing SQL engine that can combine data from multiple sources using a variety of connectors. As more and more companies discover the power of “separation of storage and compute” along with querying the data where it lies, it’s not wonder Presto is being asked to add even more functionality.

Alluxio focuses its innovation at the data layer as a key enabling technology for Presto and a wide range of analytics applications and use cases. Performance is always critical and Alluxio provides Presto caching, but providing memory speed response time is only part of the solution. If the application can’t access the data, it’s of no use. Alluxio creates a virtual data layer that aggregates data from any file or object store, providing unification across silos and allowing applications to continue using the same industry standard interfaces to access the data.

For use cases where the same data is regularly queried and due to the fact that Presto does not store or cache data, the two solutions complement each other extremely well. Coupled with Presto, the Alluxio platform provides an Enterprise, read/write block-level caching engine that connects to a variety of storage systems including S3 and HDFS.

The diagram below illustrates how Alluxio might be implemented on a public cloud such as AWS or Azure. Alluxio supports connectors for both storage types as well as HDFS:

As data is queried from S3 or Blob storage in the diagram, those blocks are cached in Alluxio. Another important and very technical detail is a feature recently added called Async Caching. This allows partial reads of data blocks in order to speed up the reading of data. When a slower storage medium like S3 and Blob storage are used, this greatly increases performance.

An example of Async Caching is reading the footers of ORC or Parquet files. This is performed by Presto in order to determine if the file contains the data required by a query. If the entire block is read just to look at the footer, then this could take many seconds to minutes. With Async Caching, this now takes seconds with the remaining data in the block read in the background without holding up the query. If Presto decides is needs to read the entire block, that data will be cached in Alluxio speeding up the query even more.

Alluxio also supports tiered storage. This allows data to be cached at different storage layers based on usage. This means the data that is being used more often is cached in the fastest tier with the lowest retrieval latency. This is of course RAM. From there, other tiers such as SSD and regular hard drives can be used for 2nd and 3rd tiers. Additionally, files can be “pinned” into the cache which allows greater flexibility for certain use cases.

Amazon S3, Azure’s Blob storage along with on-premises/private cloud object stores from Minio and CEPH provide an excellent, low-cost, fault tolerant object store for companies to store their historical and operational data. Using Presto along with popular BI reporting tools has skyrocketed in popularity and Alluxio provides these companies with an additional tool in their belts to increase performance when using these object stores.

To get started click here for Starburst Presto and here to download Alluxio.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo