hive Archives | Alluxio

Improving Presto performance with Alluxio at TikTok

June 24, 2021

Nowadays it is not straightforward to integrate Alluxio with popular query engines like Presto on existing Hive data. Solutions proposed by the community like Alluxio Catalog Service or Transparent URI brings unnecessary pressure on Alluxio masters when querying files should not be cached.

Tags: alluxio day, cache layer, hive, presto, tiktok

“Zero-Copy” Hybrid Cloud for Data Analytics – Strategy, Architecture and Benchmark Report

April 6, 2020

This whitepaper details how to leverage a public cloud, such as Amazon AWS, Google GCP, or Microsoft Azure to scale analytic workloads directly on data on-premises without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Presto and Hive with Alluxio in the public cloud using on-prem HDFS. We will also show how to set up and execute performance benchmarks in two geographically dispersed Amazon EMR clusters along with a summary of our findings.

Tags: aws, azure, data analytics, emr, gcp, hdfs, hive, hybrid cloud, presto, public cloud, zero copy

Everything you want to know about how to decouple SQL engines from Hive Data Warehouse

March 30, 2020 By Gene Pang

Are you using SQL engines, such as Presto, to query existing Hive data warehouse and experiencing challenges including overloaded Hive Metastore with slow and unpredictable access, unoptimized data formats and layouts such as too many small files, or lack of influence over the existing Hive system and other Hive applications?

Optimizing Query Performance by Decoupling Presto and Hive Data Warehouse

March 24, 2020

Ideally, Presto would access data independently from how the data was originally stored or managed. Alluxio, as a data orchestration layer provides the physical data independence, for Presto to interact with the data more efficiently. In addition to caching for IO acceleration, Alluxio also provides a catalog service to abstract the metadata in the Hive Metastore, and transformations to expose the data in compute-optimized way. In this talk, we describe some of the challenges of using Presto with Hive, and introduce Alluxio data orchestration for solving those challenges.

Tags: alluxio engineering, catalog service, data orchestration, hive, office hour, performance, presto, structured data services

Burst Presto & Spark workloads to AWS EMR with no data copies

Community Online Office Hour * April 28, 2020

In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.

Optimizing Query Performance by Decoupling Presto and Hive Data Warehouse

Community Online Office Hour * March 24, 2020

Alluxio, as a data orchestration layer provides the physical data independence, for Presto to interact with the data more efficiently. In addition to caching for IO acceleration, Alluxio also provides a catalog service to abstract the metadata in the Hive Metastore, and transformations to expose the data in compute-optimized way. In this talk, we describe some of the challenges of using Presto with Hive, and introduce Alluxio data orchestration for solving those challenges.

Tags: alluxio engineering, aws s3, compute storage separation, hdfs, hive, office hour, spark

Tag: hive

Improving Presto performance with Alluxio at TikTok

“Zero-Copy” Hybrid Cloud for Data Analytics – Strategy, Architecture and Benchmark Report

Everything you want to know about how to decouple SQL engines from Hive Data Warehouse

Optimizing Query Performance by Decoupling Presto and Hive Data Warehouse

Optimizing Query Performance by Decoupling Presto and Hive Data Warehouse

Tutorial: Presto + Alluxio + Hive Metastore on Your Laptop in 10 min

Getting Started with EMR Hive on Alluxio in 10 Minutes

Community Office Hour: Accelerating Hive with Alluxio on S3