sql Archives | Alluxio

ML-Based SQL Query Resource Usage Prediction

September 15, 2022

With the advent of the Big Data era, it is usually computationally expensive to calculate the resource usages of a SQL query. Can we estimate the resource usages of SQL queries more efficiently without any computation in a SQL engine kernel? In this session, Chunxu and Beinan would like to introduce how Twitter’s data platform leverages a machine learning-based approach in Presto and BigQuery to estimate query utilization with 90%+ accuracy.

Tags: alluxio day, big data, machine learning, presto, sql, twitter

Improving Presto Architectural Decisions with Alluxio Shadow Cache at Meta (Facebook)

March 30, 2022 By Ke Wang and Zhenyu Song

With the collaboration between Meta (Facebook), Princeton University, and Alluxio, we have developed “Shadow Cache” – a lightweight Alluxio component to track the working set size and infinite cache hit ratio. Shadow cache can keep track of the working set size over the past window dynamically and is implemented by a series of bloom filters. Shadow cache is deployed in Meta (Facebook) Presto and is being leveraged to understand the system bottleneck and help with routing design decisions.

Using Consistent Hashing in Presto to Improve Caching Data Locality in Dynamic Clusters

February 2, 2022 By Rongrong Zhong

Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot dataset close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.

Presto: Fast SQL-on-anything across data lakes, DBMS, and NoSQL Data stores

December 13, 2020

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Comcast, GrubHub, FINRA, LinkedIn, Lyft, Netflix, Slack, Zalando, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

Tags: data orchestration, data orchestration summit, presto, sql, starburst

StorageQuery: federated querying on object stores, powered by Alluxio and Presto

August 25, 2020

Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery – A platform to query files in cloud storages with SQL.

Tags: cloud storage, compute storage separation, meetup, object stores, presto, shannondb, sql, storagequery, under filesystem

Everything you want to know about how to decouple SQL engines from Hive Data Warehouse

March 30, 2020 By Gene Pang

Are you using SQL engines, such as Presto, to query existing Hive data warehouse and experiencing challenges including overloaded Hive Metastore with slow and unpredictable access, unoptimized data formats and layouts such as too many small files, or lack of influence over the existing Hive system and other Hive applications?

Serving Structured Data in Alluxio: Example

March 11, 2020 By Gene Pang

This article goes through a simple example to illustrate how Structured Data Management available in the latest Alluxio 2.2.0 release to help SQL and structured data workloads.

Serving Structured Data in Alluxio: Concept

March 11, 2020 By Gene Pang

This article introduces Structured Data Management available in the latest Alluxio 2.2.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio.

Tag: sql