Alluxio Data Orchestration Summit

video

Building a high-performance platform on AWS to support real-time gaming services using Presto, Alluxio, and S3

Serena (Teng) Wang

Software Engineer

Electronic Arts

Electronic Arts (EA) is a leading company in the gaming industry, providing over a thousand games to serve billions of users worldwide. The EA Data & AI Department builds hundreds of platforms to manage petabytes of data generated by games and users every day. These platforms consist of a wide range of data analytics, from real-time data ingestion to ETL pipelines. Formatted data produced by our department is widely adopted by executives, producers, product managers, game engineers, and designers for marketing and monetization, game design, customer engagement, player retention, and end-user experience.

Near real-time information for EA’s online services is critical for making business decisions, such as campaigns and troubleshooting. These services include, but are not limited to, real-time data visualization, dashboarding, and conversational analytics. Highly time-sensitive applications such as BI software, dashboards and AI tools heavily rely on these services. To support these use cases, we studied an innovative platform with Presto as the computing engine and Alluxio as a data orchestration layer between Presto and S3 storage. We evaluated this platform with real industrial examples of data visualization, dashboarding, and a conversational chatbot. Our preliminary results show that Presto with Alluxio outperforms S3 significantly in all cases, with a 6x performance gain when handling a large number of small files.

video

Reducing large S3 API costs using Alluxio at Datasapiens

Juraj Pohanka

CTO

Datasapiens

Koen Michiels

CEO

Datasapiens

Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.

In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.

This talk will focus on:

The Hadoop ecosystem at Datasapiens
Drastic increase of S3 API costs during performance tests with Presto
S3 API costs tests with TPC-DS
Implications to the cloud data lake architecture

video

Building a scalable analytics environment to support diverse workloads

Tom Panozzo

Chief Technology Officer

Aunalytics

video

How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio

Katarzyna Orzechowska

Data Scientist

ING

Mariusz Derela

DevOps Engineer

ING

video

Powering interactive analytics with Alluxio and Presto

Dima Dermanskyi

Data Engineering Lead

WalkMe

video

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio

Ke Wang

Software Engineer

Meta

For many latency-sensitive SQL workloads, Presto is often bound by retrieving distant data. In this talk, Rohit Jain from Facebook will introduce their teams’ collaboration with Alluxio on adding a local on-SSD Alluxio cache inside Presto workers at Facebook to improve queries with unsatisfied latency.

video

Exploring Alluxio for Daily Tasks at Robinhood

Jiawei Zhang

Data Platform Engineer

Robinhood

Yichuan Huang

Data Platform Engineer

Robinhood

Grace Lu

Data Platform Engineer

Robinhood

Wenlong Xiong

Data Platform Engineer

Robinhood

video

Presto: Fast SQL-on-anything across data lakes, DBMS, and NoSQL Data stores

Kamil Bajda-Pawlikowski

CTO

Starburst

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Comcast, GrubHub, FINRA, LinkedIn, Lyft, Netflix, Slack, Zalando, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines including Presto.

In this talk we discuss how Presto enables query-time correlations between Delta Lake, Snowflake, and Elasticsearch to drive interactive BI analytics across disparate datasets.

video

How Presto & Alluxio leverage our data-platform at Ryte

Danny Linden

Lead Software Engineer

Ryte

video

High Performance Data Lake with Apache Hudi and Alluxio at T3Go

Trevor Zhang

Big Data Sr Engineer

T3Go

Vino Yang

Head of Big Data Platform

T3GO

video

Speeding Up Spark Performance using Alluxio at China Unicom

Ce Zhang

Big Data Engineer

China Unicom

Unicom’s traditional batch architecture consists mainly of IOE, Hive, and Greenplum systems. With the development of business, a large number of computing application modules based on diverse scenarios, chimney-like, decentralized applications have emerged. To solve the problem of resource fragmentation, we have introduced a unified computing platform for computing ecology with Spark and Alluxio as the core. Alluxio plays an important role in accelerating data processing and ensuring process stability.

video

Securely Enhancing Data Access in Hybrid Cloud with Alluxio

Michael Fagan

Distinguished Architect

Comcast

Prashant Khanolkar

Big Data Architect

Comcast

video

Bursting on-premise analytic workloads to Amazon EMR using Alluxio

Roy Hassan

Analytics Specialist

AWS

Data infrastructure on-premises is increasingly complex and cloud adoption is attractive for business agility. Operating a hybrid environment is an approach to start benefiting from cloud elasticity quickly without abandoning the infrastructure on-premises. In this session I will discuss the benefits of using Alluxio’s Data Orchestration Platform to dynamically burst Apache Spark and Presto workloads to Amazon EMR for best performance and agility.

video

Hybrid Data Lake on Google Cloud with Alluxio and Dataproc

Roderick Yao

Product Manager

Google

video

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Jiao (Jennie) Wang

Software Engineer

Intel

Louie Tsai

AI Software Architecture

Intel

Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.

Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.

This talk, we will go over:

What is Analytics Zoo and how it works
How to run Analytics Zoo with Alluxio in deep learning applications
Initial performance benchmark results using the Analytics Zoo + Alluxio stack

video

Deep Learning in the Cloud at Scale: A Data Orchestration Story

Mickey Zhang

Software Engineer

Microsoft

video

Fluid: When Alluxio Meets Kubernetes

Rong Gu

Founding PMC member

Alluxio

Yang Che

Staff Engineer

Alibaba

Nowadays, cloud native environments have attracted lots of data-intensive applications deployed and ran on them, due to the efficient-to-deploy and easy-to-maintain advantages provided by cloud native platforms and frameworks such as Docker, Kubernetes. However, cloud native frameworks does not provide the data abstraction support to the applications natively. Therefore, we build Fluid project, which co-orchestrate data and containers together. We use Alluxio as the cache runtime inside Fluid to warm up hot data. In this report, we will introduce the design and effects of the Fluid project.

video

Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid

Yuandong Xie

Platform Researcher

Unisound AI Labs

Unisound focuses on Artificial Intelligence services for the Internet of Things. It is an artificial intelligence company with completely independent intellectual property rights and the world’s top intelligent voice technology. Atlas is the Deep Learning platform within Unisound AI Labs, which provides deep learning pipeline support for hundreds of algorithm scientists. This talk shares three real business training scenarios that leverage Alluxio’s distributed caching capabilities and Fluid’s cloud native capabilities, and achieve significant training acceleration and solve platform IO bottlenecks. We hope that the practice of Alluxio & Fluid on Atlas platform will bring benefits to more companies and engineers.

video

The hidden engineering behind machine learning products at Helixa

Gianmario Spacagna

Chief Scientist and Head of AI

Helixa

Data and Machine Learning (ML) technologies are now widespread and adopted by literally all industries. Although recent advancements in the field have reached an unthinkable level of maturity, many organizations still struggle with turning these advances into tangible profits. Unfortunately, many ML projects get stuck in a proof-of-concept stage without ever reaching customers and generating revenue. In order to effectively adopt ML technologies, enterprises need to build the right business cases as well as to be ready to face the inevitable technical challenges. In this talk, we will share some common pitfalls, lessons learned, and engineering practices, faced while building customer-facing enterprise ML products. In particular, we will focus on the engineering that delivers real-time audience insights everyday to thousands of marketers via the Helixa’s market research platform.

During the talk you will learn:

An overview of the Helixa ML end-to-end system
Useful engineering practices and recommended tools (PyData stack, AWS, Alluxio, scikit-learn, tensorflow, mlflow, jupyter, github, docker, Spark, to name a few..)
The R&D workflow and how it integrates with the production system
Infrastructure considerations for scalable and cheap deployment, monitoring, and alerting
How to leverage modern cloud serverless architectures for data and machine learning applications

video

Achieving Massive Concurrency and Sub-second query latency on Cloud warehouses and data lakes with kyligence cloud

George Demarest

Head of Marketing

Kyligence

Enterprises everywhere are racing to build the optimal analytics stack for creating repeatable success with predictive analytics, machine learning, and data applications. Cloud data platforms like data warehouses and data lakes are foundational elements of these software stacks and their associated data pipelines. But existing SQL query methods against these data platforms have repeatedly demonstrated disappointing performance and scaling due to poor concurrency.

In this presentation, we will discuss the use of the intelligent precomputation capabilities of Kyligence Cloud as a means of delivering on the promise of pervasive analytics at scale with massive concurrency and sub-second query latencies on large datasets in the cloud.

Kyligence, with our partner Alluxio, sits between the data platform and the processing layer. Kyligence Cloud delivers precomputed datasets for OLAP queries, BI dashboards, and machine learning applications.

video

Accelerating Data Computation on Ceph Objects using Alluxio

Leonardo Militano

Senior researcher at the Service Engineering lab

Zurich University of Applied Sciences (ZHAW)

In most of the distributed storage systems, the data nodes are decoupled from compute nodes. This is motivated by an improved cost efficiency, storage utilization and a mutually independent scalability of computation and storage. While this consideration is indisputable, several situations exist where moving computation close to the data brings important benefits. Whenever the stored data is to be processed for analytics purposes, all the data needs to be repeatedly moved from the storage to the compute cluster, which leads to reduced performance.

In this talk, we will present how using Alluxio computation and storage ecosystems can better interact benefiting of the “bringing the data close to the code” approach. Moving away from the complete disaggregation of computation and storage, data locality can enhance the computation performance. During this talk, we will present our observations and testing results that will show important enhancements in accelerating Spark Data Analytics on Ceph Objects Storage using Alluxio.

video

Unified Data Access with Gimel

Deepak Chandramouli

Engineering Lead

PayPal

Anisha Nainani

Senior Software Engineer

PayPal

Dr. Vladimir Bacvanski

Principal Architect with Strategic Architecture

PayPal

At PayPal & any other data driven enterprise – data users & applications work with a variety of data sources (RDBMS, NoSQL, Messaging, Documents, Big Data, Time Series Databases), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive) to process petabytes of data. Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM).

To solve this problem and to make product development more effective, PayPal Data Platforms developed “Gimel”, an open source, unified analytics data platform which provides access to any storage through a single unified data API and SQL, which are powered by a centralized data catalog.

video

How to Build a new under filesystem in Alluxio: Apache Ozone as an example

Davy Wang

Chief Solutions Architect

Tencent Cloud International

Baolong Mao

Sr. System Engineer

Tencent Data Lake R&D

video

The practice of Presto & Alluxio in E-commerce big data platform

Wenjun Tao

Senior Software Engineer

JD.com

JD.com is one of the largest e-commerce corporations. In big data platform of JD.com, there are tens of thousands of nodes and tens of petabytes off-line data which require millions of spark and MapReduce jobs to process everyday. As the main query engine, thousands of machines work as Presto nodes and Presto plays an import role in the field of In-place analysis and BI tools. Meanwhile, Alluxio is deployed to improve the performance of Presto. The practice of Presto & Alluxio in JD.com benefits a lot of engineers and analysts.

video

Data Orchestration for Analytics and AI in the Cloud Era

Haoyuan Li

Founder & CEO

Alluxio

Data platforms span multiple clusters, regions and clouds to meet the business needs for agility, cost effectiveness, and efficiency. Organizations building data platforms for structured and unstructured data have standardized on separation of storage and compute to remain flexible while avoiding vendor lock-in. Data orchestration has emerged as the foundation of such a data platform for multiple use cases all the way from data ingestion to transformations to analytics and AI.

In this keynote from Haoyuan Li, founder and CEO of Alluxio, we will showcase how organizations have built data platforms based on data orchestration. The need to simplify data management and acceleration across different business personas has given rise to data orchestration as a requisite piece of the modern data platform. In addition, we will outline typical journeys for realizing a hybrid and multi-cloud strategy.

video

Alluxio Use Cases and Future Directions

Bin Fan

VP of Technology

Alluxio

Calvin Jia

In this keynote, Calvin Jia will share some of the hottest use cases in Alluxio 2 and discuss the future directions of the project being pioneered by Alluxio and the community. Bin Fan will provide an overview of the growth of Alluxio open-source community with highlights on community-driven collaboration with engineering teams from Microsoft and Alibaba to advance the technology.

video

The Pandemic Changes Everything, The need for speed and resiliency

Parviz Peiravi

Global CTO for Financial Services Industry Solutions

Intel

video

The Future of Computing is Distributed

Ion Stoica

Professor

EECS Department at UC Berkeley

Distributed applications are not new. The first distributed applications were developed over 50 years ago with the arrival of computer networks, such as ARPANET. Since then, developers have leveraged distributed systems to scale out applications and services, including large-scale simulations, web serving, and big data processing. However, until recently, distributed applications have been the exception, rather than the norm. However, this is changing quickly. There are two major trends fueling this transformation: the end of Moore’s Law and the exploding computational demands of new machine learning applications. These trends are leading to a rapidly growing gap between application demands and single-node performance which leaves us with no choice but to distribute these applications. Unfortunately, developing distributed applications is extremely hard, as it requires world-class experts. To make distributed computing easy, we have developed Ray, a framework for building and running general-purpose distributed applications.

video

Introducing the Hub for Data Orchestration

Adit Madan

Director of Product Management

Alluxio

We introduce Data Orchestration Hub, a management service that makes it easy to build an analytics or machine learning platform on data sources across regions to unify data lakes. Easy to use wizards connect compute engines, such as Presto or Spark, to data sources across data centers or from a public cloud to a private data center. In this session, you will witness the use of “The Hub” to connect a compute cluster in the cloud with data sources on-premises using Alluxio. This new service allows you to build a hybrid cloud on your own, without the expertise needed to manage or configure Alluxio.

video

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Sandipan Chakraborty

Director of Engineering

Rakuten

video

tf.data: TensorFlow Input Pipeline

Jiri Simsa

Software Engineer

Google

tf.data is the recommended API for creating TensorFlow input pipelines and is relied upon by countless external and internal Google users. The API enables you to build complex input pipelines from simple, reusable pieces and makes it possible to handle large amounts of data, different data formats, and perform complex transformations. In this talk, I will present an overview of the project and highlight best practices for creating performant input pipelines.

video

Apache Iceberg – A Table Format for Huge Analytic Datasets

Ryan Blue

Co-creator of Apache Iceberg

video

Orchestrate a Data Symphony

Haoyuan Li

Founder & CEO

Alluxio

video

Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes

Eric Li

Senior Architect

Alibaba Cloud

video

From Files to Tables: Alluxio Structured Data Management

Gene Pang

PMC Maintainer & founding member

Alluxio

video

Modern Data Platforms – Thinking Data Flywheel on the Cloud

Roy Ben-Alta

Head of WW Data & Analytics/ML and Robotics Practice

Amazon Web Services

The Data Flywheel is a comprehensive and additive approach for business and technology leaders to enable organizations to get the most value from their data. In this session, we will share common design patterns AWS customers are applying as part of their Data and AI journey. It will include real world examples. ‍ Modern Data Platforms – Thinking Data Flywheel on the Cloud from Alluxio, Inc. ‍

video

Legend, Legacy, Orchestration: Challenge and Evolution of Data Orchestration at Rakuten Data System

Lei Ai

Big Data Architect

Rakuten

video

How to Run Fast Presto Analytics with Alluxio in Cloud – a Production Experience

Danny Linden

Lead Software Engineer

Ryte

At Ryte, we analyze unstructured, semi-structured and structured data for more than one million users worldwide. The whole Ryte-Platform is built with a scalable architecture to support our heavy load and make it possible for our customers to drill-down from a high-level overview into the last byte of their websites. ‍ Presto + Alluxio on steroids a romantic drama on Production with happy end from Alluxio, Inc. ‍

video

What’s New in Alluxio 2

Calvin Jia

Bin Fan

VP of Technology

Alluxio

video

Presto: Query Anything – Data Engineer’s Perspective

Kamil Bajda-Pawlikowski

CTO

Starburst

Martin Traverso

Co-creator of Presto and Co-founder of the Presto Software Foundation

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

This talk will discuss best use cases for Presto from the Data Engineer’s perspective. In addition, we will present the recent Presto advancements such as Cost-Based Optimizer, Kubernetes-native deployment and the project roadmap going forward.

video

How to Develop and Operate Cloud Native Data Platforms and Applications

Du Li

Architect of Data Infrastructure

Electronic Arts

Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding systems were not designed in a cloud-native way.

In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude.

In the first generation of the data platform, we used HDFS as a cache layer between ETL jobs and the underlying AWS storage service S3. However, HDFS is not a special-purpose cache service, so custom code is needed to make it work like a cache. We have to run a backup workflow in every ETL job to backup data to S3 and sync the metadata store of the ETL jobs running on HDFS and that of interactive analytic queries running directly on S3. Moreover, we rely on complex and fragile mechanisms for purging datasets when the clusters are under heavy load. The use of HDFS also makes it a challenge to rapidly scale up the YARN cluster during peak hours and scale it down during off-hours. We are currently redesigning the data platform, mainly by replacing HDFS with a special-purpose data orchestration service called Alluxio. In our initial evaluation, Alluxio not only provides better performance than HDFS but also significantly simplifies the architecture of our data platform and makes it easy to scale up and down and paves the way to a cloud native ETL processing stack.

video

Enterprise Distributed Query Service Powered by Presto & Alluxio Across Clouds at WalmartLabs

Ashish Tadose

Principal Software Engineer

WalmartLabs

This DATA ORCHESTRATION SUMMIT session talks about challenges associated with querying diverse data sources at Walmart and how those are tackled using Presto & Alluxio.

How Alluxio caching was leveraged to provide consistent optimized query performance within and across clouds.

Also highlights implementation of critical components for Enterprise acceleration offering such as security integration for fine grained access control, auto-scaling & auto deployment in GCP.

video

Open Source Panel: How to create an open source project

Ben Lorica

Chief Data Scientist

O’Reilly Media

Maxime Beauchemin

CEO and founder

Preset

Tobi Knaup

CTO and Co-Founder

Mesosphere

Haoyuan Li

Founder & CEO

Alluxio

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer