This whitepaper details how to evaluate Alluxio’s data orchestration platform as a distributed cache for Apache Spark in a public cloud or on-premises. We discuss best practices and benchmarking results with a combination of standard industry benchmarking suites, such as TPC-DS and HiBench, on cloud storage.
Tag: data analytics
In this keynote, you will learn about the evolution of the global data platform at Rakuten spread across multiple regions, and clouds. In addition, you will hear about the journey across the years, and the use of data orchestration for multiple use cases.
How T3Go’s high-performance data lake using Apache Hudi and Alluxio shortened the time for data ingestion into the lake by up to a factor of 2. Data analysts using Presto, Hudi, and Alluxio in conjunction to query data on the lake saw queries speed up by 10 times faster.
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This article describes how engineers in the Data Service Center at Tencent PCG leverages Alluxio to optimize the analytics performance by 200% and minimize the operating cost in building Tencent Beacon Growing, a real-time data analytics platform.
This whitepaper details how to leverage a public cloud, such as Amazon AWS, Google GCP, or Microsoft Azure to scale analytic workloads directly on data on-premises without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Presto and Hive with Alluxio in the public cloud using on-prem HDFS. We will also show how to set up and execute performance benchmarks in two geographically dispersed Amazon EMR clusters along with a summary of our findings.
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
This talk will overview two projects at Electronic Arts (EA) that address the mismatch by data orchestration: One project automatically generates configurations for all components in a large monitoring system, which reduces the daily average number of alerts from ~1000 to ~20. The other project introduces Alluxio for caching and unifying address space across ETL and analytics workloads, which substantially simplifies architecture, improves performance, and reduces ops overheads.