Products
Alluxio Data Orchestration Summit
video
Building a high-performance platform on AWS to support real-time gaming services using Presto, Alluxio, and S3
Serena (Teng) Wang
Software Engineer
Electronic Arts
Electronic Arts (EA) is a leading company in the gaming industry, providing over a thousand games to serve billions of users worldwide. The EA Data & AI Department builds hundreds of platforms to manage petabytes of data generated by games and users every day. These platforms consist of a wide range of data analytics, from real-time data ingestion to ETL pipelines. Formatted data produced by our department is widely adopted by executives, producers, product managers, game engineers, and designers for marketing and monetization, game design, customer engagement, player retention, and end-user experience.
Near real-time information for EA’s online services is critical for making business decisions, such as campaigns and troubleshooting. These services include, but are not limited to, real-time data visualization, dashboarding, and conversational analytics. Highly time-sensitive applications such as BI software, dashboards and AI tools heavily rely on these services. To support these use cases, we studied an innovative platform with Presto as the computing engine and Alluxio as a data orchestration layer between Presto and S3 storage. We evaluated this platform with real industrial examples of data visualization, dashboarding, and a conversational chatbot. Our preliminary results show that Presto with Alluxio outperforms S3 significantly in all cases, with a 6x performance gain when handling a large number of small files.
video
Reducing large S3 API costs using Alluxio at Datasapiens
Juraj Pohanka
CTO
Datasapiens
Koen Michiels
CEO
Datasapiens
Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights.
In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset.
This talk will focus on:
- The Hadoop ecosystem at Datasapiens
- Drastic increase of S3 API costs during performance tests with Presto
- S3 API costs tests with TPC-DS
- Implications to the cloud data lake architecture
video
Building a scalable analytics environment to support diverse workloads
Tom Panozzo
Chief Technology Officer
Aunalytics
video
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska
Data Scientist
ING
Mariusz Derela
DevOps Engineer
ING
video
Powering interactive analytics with Alluxio and Presto
Dima Dermanskyi
Data Engineering Lead
WalkMe
Video: Presentation Slides: Presentation Slides: Powering Interactive Analytics with Alluxio and Presto from Alluxio, Inc.
video
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio
Ke Wang
Software Engineer
Meta
For many latency-sensitive SQL workloads, Presto is often bound by retrieving distant data. In this talk, Rohit Jain from Facebook will introduce their teams’ collaboration with Alluxio on adding a local on-SSD Alluxio cache inside Presto workers at Facebook to improve queries with unsatisfied latency.
video
Exploring Alluxio for Daily Tasks at Robinhood
Jiawei Zhang
Data Platform Engineer
Robinhood
Yichuan Huang
Data Platform Engineer
Robinhood
Grace Lu
Data Platform Engineer
Robinhood
Wenlong Xiong
Data Platform Engineer
Robinhood
video
Presto: Fast SQL-on-anything across data lakes, DBMS, and NoSQL Data stores
Kamil Bajda-Pawlikowski
CTO
Starburst
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Comcast, GrubHub, FINRA, LinkedIn, Lyft, Netflix, Slack, Zalando, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines including Presto.
In this talk we discuss how Presto enables query-time correlations between Delta Lake, Snowflake, and Elasticsearch to drive interactive BI analytics across disparate datasets.
video
How Presto & Alluxio leverage our data-platform at Ryte
Danny Linden
Lead Software Engineer
Ryte
Presto & Alluxio on AWS: How we build a Up-To-Date Data-Platform at Ryte. Video: Presentation Slides: Introducing the Hub for Data Orchestration from Alluxio, Inc.
video
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang
Big Data Sr Engineer
T3Go
Vino Yang
Head of Big Data Platform
T3GO
This talk introduces T3Go’s solution in building an enterprise-level data lake based on Apache Hudi & Alluxio, and how to use Alluxio to accelerate the reading and writing of data on the data lake when compute and storage are segregated.
video
Speeding Up Spark Performance using Alluxio at China Unicom
Ce Zhang
Big Data Engineer
China Unicom
Unicom’s traditional batch architecture consists mainly of IOE, Hive, and Greenplum systems. With the development of business, a large number of computing application modules based on diverse scenarios, chimney-like, decentralized applications have emerged. To solve the problem of resource fragmentation, we have introduced a unified computing platform for computing ecology with Spark and Alluxio as the core. Alluxio plays an important role in accelerating data processing and ensuring process stability.
video
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Michael Fagan
Distinguished Architect
Comcast
Prashant Khanolkar
Big Data Architect
Comcast
Describe benefits and methods Alluxio enables secure data access in the Comcast’s dx hybrid data cloud.
- Review the data access challenges and tradeoffs in hybrid cloud
- Review our hybrid architecture and the important role Alluxio plays
- Provide performance metrics to highlight the benefits
video
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Roy Hassan
Analytics Specialist
AWS
Data infrastructure on-premises is increasingly complex and cloud adoption is attractive for business agility. Operating a hybrid environment is an approach to start benefiting from cloud elasticity quickly without abandoning the infrastructure on-premises. In this session I will discuss the benefits of using Alluxio’s Data Orchestration Platform to dynamically burst Apache Spark and Presto workloads to Amazon EMR for best performance and agility.
video
Hybrid Data Lake on Google Cloud with Alluxio and Dataproc
Roderick Yao
Product Manager
Google
Dataproc is Google’s managed Hadoop and Spark platform. In this talk, we will showcase how to swiftly build a hybrid cloud data platform with Alluxio and Presto and migrate data seamlessly.
video
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Jiao (Jennie) Wang
Software Engineer
Intel
Louie Tsai
AI Software Architecture
Intel
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
video
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Mickey Zhang
Software Engineer
Microsoft
video
Fluid: When Alluxio Meets Kubernetes
Rong Gu
Founding PMC member
Alluxio
Yang Che
Staff Engineer
Alibaba
Nowadays, cloud native environments have attracted lots of data-intensive applications deployed and ran on them, due to the efficient-to-deploy and easy-to-maintain advantages provided by cloud native platforms and frameworks such as Docker, Kubernetes. However, cloud native frameworks does not provide the data abstraction support to the applications natively. Therefore, we build Fluid project, which co-orchestrate data and containers together. We use Alluxio as the cache runtime inside Fluid to warm up hot data. In this report, we will introduce the design and effects of the Fluid project.
video
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Yuandong Xie
Platform Researcher
Unisound AI Labs
Unisound focuses on Artificial Intelligence services for the Internet of Things. It is an artificial intelligence company with completely independent intellectual property rights and the world’s top intelligent voice technology. Atlas is the Deep Learning platform within Unisound AI Labs, which provides deep learning pipeline support for hundreds of algorithm scientists. This talk shares three real business training scenarios that leverage Alluxio’s distributed caching capabilities and Fluid’s cloud native capabilities, and achieve significant training acceleration and solve platform IO bottlenecks. We hope that the practice of Alluxio & Fluid on Atlas platform will bring benefits to more companies and engineers.
video
The hidden engineering behind machine learning products at Helixa
Gianmario Spacagna
Chief Scientist and Head of AI
Helixa
Data and Machine Learning (ML) technologies are now widespread and adopted by literally all industries. Although recent advancements in the field have reached an unthinkable level of maturity, many organizations still struggle with turning these advances into tangible profits. Unfortunately, many ML projects get stuck in a proof-of-concept stage without ever reaching customers and generating revenue. In order to effectively adopt ML technologies, enterprises need to build the right business cases as well as to be ready to face the inevitable technical challenges. In this talk, we will share some common pitfalls, lessons learned, and engineering practices, faced while building customer-facing enterprise ML products. In particular, we will focus on the engineering that delivers real-time audience insights everyday to thousands of marketers via the Helixa’s market research platform.
During the talk you will learn:
- An overview of the Helixa ML end-to-end system
- Useful engineering practices and recommended tools (PyData stack, AWS, Alluxio, scikit-learn, tensorflow, mlflow, jupyter, github, docker, Spark, to name a few..)
- The R&D workflow and how it integrates with the production system
- Infrastructure considerations for scalable and cheap deployment, monitoring, and alerting
- How to leverage modern cloud serverless architectures for data and machine learning applications
video
Achieving Massive Concurrency and Sub-second query latency on Cloud warehouses and data lakes with kyligence cloud
George Demarest
Head of Marketing
Kyligence
Enterprises everywhere are racing to build the optimal analytics stack for creating repeatable success with predictive analytics, machine learning, and data applications. Cloud data platforms like data warehouses and data lakes are foundational elements of these software stacks and their associated data pipelines. But existing SQL query methods against these data platforms have repeatedly demonstrated disappointing performance and scaling due to poor concurrency.
In this presentation, we will discuss the use of the intelligent precomputation capabilities of Kyligence Cloud as a means of delivering on the promise of pervasive analytics at scale with massive concurrency and sub-second query latencies on large datasets in the cloud.
Kyligence, with our partner Alluxio, sits between the data platform and the processing layer. Kyligence Cloud delivers precomputed datasets for OLAP queries, BI dashboards, and machine learning applications.
video
Accelerating Data Computation on Ceph Objects using Alluxio
Leonardo Militano
Senior researcher at the Service Engineering lab
Zurich University of Applied Sciences (ZHAW)
In most of the distributed storage systems, the data nodes are decoupled from compute nodes. This is motivated by an improved cost efficiency, storage utilization and a mutually independent scalability of computation and storage. While this consideration is indisputable, several situations exist where moving computation close to the data brings important benefits. Whenever the stored data is to be processed for analytics purposes, all the data needs to be repeatedly moved from the storage to the compute cluster, which leads to reduced performance.
In this talk, we will present how using Alluxio computation and storage ecosystems can better interact benefiting of the “bringing the data close to the code” approach. Moving away from the complete disaggregation of computation and storage, data locality can enhance the computation performance. During this talk, we will present our observations and testing results that will show important enhancements in accelerating Spark Data Analytics on Ceph Objects Storage using Alluxio.
video
Unified Data Access with Gimel
Deepak Chandramouli
Engineering Lead
PayPal
Anisha Nainani
Senior Software Engineer
PayPal
Dr. Vladimir Bacvanski
Principal Architect with Strategic Architecture
PayPal
At PayPal & any other data driven enterprise – data users & applications work with a variety of data sources (RDBMS, NoSQL, Messaging, Documents, Big Data, Time Series Databases), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive) to process petabytes of data. Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM).
To solve this problem and to make product development more effective, PayPal Data Platforms developed “Gimel”, an open source, unified analytics data platform which provides access to any storage through a single unified data API and SQL, which are powered by a centralized data catalog.
video
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Davy Wang
Chief Solutions Architect
Tencent Cloud International
Baolong Mao
Sr. System Engineer
Tencent Data Lake R&D
In this talk, Baolong Mao from Tencent will share his experience in developing Apache Ozone under file system, showing how to create a new Under File System in a few steps with minimal lines of code.
video
The practice of Presto & Alluxio in E-commerce big data platform
Wenjun Tao
Senior Software Engineer
JD.com
JD.com is one of the largest e-commerce corporations. In big data platform of JD.com, there are tens of thousands of nodes and tens of petabytes off-line data which require millions of spark and MapReduce jobs to process everyday. As the main query engine, thousands of machines work as Presto nodes and Presto plays an import role in the field of In-place analysis and BI tools. Meanwhile, Alluxio is deployed to improve the performance of Presto. The practice of Presto & Alluxio in JD.com benefits a lot of engineers and analysts.
video
Data Orchestration for Analytics and AI in the Cloud Era
Haoyuan Li
Founder & CEO
Alluxio
Data platforms span multiple clusters, regions and clouds to meet the business needs for agility, cost effectiveness, and efficiency. Organizations building data platforms for structured and unstructured data have standardized on separation of storage and compute to remain flexible while avoiding vendor lock-in. Data orchestration has emerged as the foundation of such a data platform for multiple use cases all the way from data ingestion to transformations to analytics and AI.
In this keynote from Haoyuan Li, founder and CEO of Alluxio, we will showcase how organizations have built data platforms based on data orchestration. The need to simplify data management and acceleration across different business personas has given rise to data orchestration as a requisite piece of the modern data platform. In addition, we will outline typical journeys for realizing a hybrid and multi-cloud strategy.
video
Alluxio Use Cases and Future Directions
Bin Fan
VP of Technology
Alluxio
Calvin Jia
In this keynote, Calvin Jia will share some of the hottest use cases in Alluxio 2 and discuss the future directions of the project being pioneered by Alluxio and the community. Bin Fan will provide an overview of the growth of Alluxio open-source community with highlights on community-driven collaboration with engineering teams from Microsoft and Alibaba to advance the technology.
video
The Pandemic Changes Everything, The need for speed and resiliency
Parviz Peiravi
Global CTO for Financial Services Industry Solutions
Intel
video
The Future of Computing is Distributed
Ion Stoica
Professor
EECS Department at UC Berkeley
Distributed applications are not new. The first distributed applications were developed over 50 years ago with the arrival of computer networks, such as ARPANET. Since then, developers have leveraged distributed systems to scale out applications and services, including large-scale simulations, web serving, and big data processing. However, until recently, distributed applications have been the exception, rather than the norm. However, this is changing quickly. There are two major trends fueling this transformation: the end of Moore’s Law and the exploding computational demands of new machine learning applications. These trends are leading to a rapidly growing gap between application demands and single-node performance which leaves us with no choice but to distribute these applications. Unfortunately, developing distributed applications is extremely hard, as it requires world-class experts. To make distributed computing easy, we have developed Ray, a framework for building and running general-purpose distributed applications.
video
Introducing the Hub for Data Orchestration
Adit Madan
Director of Product Management
Alluxio
We introduce Data Orchestration Hub, a management service that makes it easy to build an analytics or machine learning platform on data sources across regions to unify data lakes. Easy to use wizards connect compute engines, such as Presto or Spark, to data sources across data centers or from a public cloud to a private data center. In this session, you will witness the use of “The Hub” to connect a compute cluster in the cloud with data sources on-premises using Alluxio. This new service allows you to build a hybrid cloud on your own, without the expertise needed to manage or configure Alluxio.
video
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty
Director of Engineering
Rakuten
In this keynote, you will learn about the evolution of the global data platform at Rakuten spread across multiple regions, and clouds. In addition, you will hear about the journey across the years, and the use of data orchestration for multiple use cases.
video
tf.data: TensorFlow Input Pipeline
Jiri Simsa
Software Engineer
Google
tf.data is the recommended API for creating TensorFlow input pipelines and is relied upon by countless external and internal Google users. The API enables you to build complex input pipelines from simple, reusable pieces and makes it possible to handle large amounts of data, different data formats, and perform complex transformations. In this talk, I will present an overview of the project and highlight best practices for creating performant input pipelines.
video
Apache Iceberg – A Table Format for Huge Analytic Datasets
Ryan Blue
Co-creator of Apache Iceberg
Apache Iceberg is a new format for tracking very large scale tables that are designed for object stores like S3. This talk will include why Netflix needed to build Iceberg, the project’s high-level design, and will highlight the details that unblock better query performance.
video
Orchestrate a Data Symphony
Haoyuan Li
Founder & CEO
Alluxio
In this keynote, Haoyuan will discuss the key challenges and trends impacting data engineering, and explore the concept of Data Orchestration.
video
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Eric Li
Senior Architect
Alibaba Cloud
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes from Alluxio, Inc.
video
From Files to Tables: Alluxio Structured Data Management
Gene Pang
PMC Maintainer & founding member
Alluxio
Alluxio Innovations for Structured Data from Alluxio, Inc.
video
Modern Data Platforms – Thinking Data Flywheel on the Cloud
Roy Ben-Alta
Head of WW Data & Analytics/ML and Robotics Practice
Amazon Web Services
The Data Flywheel is a comprehensive and additive approach for business and technology leaders to enable organizations to get the most value from their data. In this session, we will share common design patterns AWS customers are applying as part of their Data and AI journey. It will include real world examples. Modern Data Platforms – Thinking Data Flywheel on the Cloud from Alluxio, Inc.
video
Legend, Legacy, Orchestration: Challenge and Evolution of Data Orchestration at Rakuten Data System
Lei Ai
Big Data Architect
Rakuten
Challenge And Evolution Of Data Orchestration at Rakuten Data System from Alluxio, Inc.
video
How to Run Fast Presto Analytics with Alluxio in Cloud – a Production Experience
Danny Linden
Lead Software Engineer
Ryte
At Ryte, we analyze unstructured, semi-structured and structured data for more than one million users worldwide. The whole Ryte-Platform is built with a scalable architecture to support our heavy load and make it possible for our customers to drill-down from a high-level overview into the last byte of their websites. Presto + Alluxio on steroids a romantic drama on Production with happy end from Alluxio, Inc.
video
What’s New in Alluxio 2
Calvin Jia
Bin Fan
VP of Technology
Alluxio
Alluxio core maintainers and founding engineers share the latest innovations in Alluxio 2. Alluxio 2 Community Update from Alluxio, Inc.
video
Presto: Query Anything – Data Engineer’s Perspective
Kamil Bajda-Pawlikowski
CTO
Starburst
Martin Traverso
Co-creator of Presto and Co-founder of the Presto Software Foundation
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
This talk will discuss best use cases for Presto from the Data Engineer’s perspective. In addition, we will present the recent Presto advancements such as Cost-Based Optimizer, Kubernetes-native deployment and the project roadmap going forward.
video
How to Develop and Operate Cloud Native Data Platforms and Applications
Du Li
Architect of Data Infrastructure
Electronic Arts
Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding systems were not designed in a cloud-native way.
In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude.
In the first generation of the data platform, we used HDFS as a cache layer between ETL jobs and the underlying AWS storage service S3. However, HDFS is not a special-purpose cache service, so custom code is needed to make it work like a cache. We have to run a backup workflow in every ETL job to backup data to S3 and sync the metadata store of the ETL jobs running on HDFS and that of interactive analytic queries running directly on S3. Moreover, we rely on complex and fragile mechanisms for purging datasets when the clusters are under heavy load. The use of HDFS also makes it a challenge to rapidly scale up the YARN cluster during peak hours and scale it down during off-hours. We are currently redesigning the data platform, mainly by replacing HDFS with a special-purpose data orchestration service called Alluxio. In our initial evaluation, Alluxio not only provides better performance than HDFS but also significantly simplifies the architecture of our data platform and makes it easy to scale up and down and paves the way to a cloud native ETL processing stack.
video
Enterprise Distributed Query Service Powered by Presto & Alluxio Across Clouds at WalmartLabs
Ashish Tadose
Principal Software Engineer
WalmartLabs
This DATA ORCHESTRATION SUMMIT session talks about challenges associated with querying diverse data sources at Walmart and how those are tackled using Presto & Alluxio.
How Alluxio caching was leveraged to provide consistent optimized query performance within and across clouds.
Also highlights implementation of critical components for Enterprise acceleration offering such as security integration for fine grained access control, auto-scaling & auto deployment in GCP.
video
Open Source Panel: How to create an open source project
Ben Lorica
Chief Data Scientist
O’Reilly Media
Maxime Beauchemin
CEO and founder
Preset
Tobi Knaup
CTO and Co-Founder
Mesosphere
Haoyuan Li
Founder & CEO
Alluxio
In this panel, creators of open source projects share their stories from why they started the project to the challenges they encountered on the way.
.png)