Data Orchestration
Summit


November 7, 2019

COMPUTER HISTORY MUSEUM • MOUNTAIN VIEW, CA


Speakers Anchor

Welcome to the first Data Orchestration Summit!

This is a one day open source community conference focused on the key data engineering challenges and solutions around building modern data and AI platforms using latest technologies such as Alluxio, Apache Spark, Apache Airflow, Presto, Tensorflow, and Kubernetes. This Summit brings together data engineers, cloud engineers, data scientists, and industry thought leaders who are solving data problems at the intersection of cloud, AI, and data.

Speakers Anchor

FEATURED SPEAKERS

Ryan Blue

Netflix

Sr. Software Engineer

Carlos Queiroz

DBS Bank

Head of Data Platform

Du Li

Electronic Arts

Software Engineer

Haoyuan li

Alluxio

Founder & CTO

Maxime Beauchemin

Apache Airflow and Superset

Founder

Ben Lorica

O’Reilly

Chief Data Scientist

Roy Ben-Alta

Amazon Web Services

Head Of WW Data
& Analytics

Ashish Tadose

WalmartLabs

Staff Enginer

Davy Wang

Tencent

GM of Tencent Cloud

Sujatha Sagiraju

Microsoft Cloud

Senior Director

Jiri Simsa

Google

Software Engineer

Kamil
Bajda-Pawlikowski

Starburst, Presto Company

Co-Founder & CTO

Martin Traverso

Presto Software Foundation

Co-Founder

Bin Fan

Alluxio

Founding Member, VP Open Source

Benjamin Hindman

D2iQ (Mesosphere)

Founder

Gene Pang

Alluxio

Founding Engineer, Head of Architecture

Steven Mih

Alluxio

CEO

Lei Ai

Rakuten

Big Data Architect

Bill Zhao

Apple

Software Engineer

Danny linden

Ryte

Chapter Lead Engineer

Rong Gu

Nanjing University

Professor of Computer Science

Calvin Jia

Alluxio

Founding Engineer, Head of Architecture

peng li

Alibaba

Architect of AlibabaCloud Container Service

See All Speakers

schedule

PROGRAM

Join us for continental breakfast and pick up some swag!
In this talk, HY and Bin will discuss the rise of data engineering, the key trends impacting data engineering, and the associated challenges. More details to come!
Speakers:
Haoyuan (H.Y.) Li is the Founder and CTO of Alluxio. He graduated with a Computer Science Ph.D. from the AMPLab at UC Berkeley. At the AMPLab, he co-created and led Alluxio (formerly Tachyon), an open source virtual distributed file system. Before UC Berkeley, he got a M.S. from Cornell University and a B.S. from Peking University, all in Computer Science.

Bin Fan is the founding engineer, VP of OS of Alluxio, and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google where he won the Technical Infrastructure Award. Bin received his Ph.D. in Computer Science from Carnegie Mellon University working on distributed systems

The big data stack has evolved over the past few years with an explosion of data frameworks, starting with MapReduce and expanding to Apache Spark and Presto. The approach to managing and storing data has evolved as well, starting from using primarily Hadoop distributed file system (HDFS) to newer, cheaper, and easier technologies like object stores.
In this talk, Carlos will dive into how DBS Bank built a modern big data analytics stack, leveraging an object store as persistent storage even for data-intensive workloads, and how it uses Alluxio to orchestrate data locality and data access for Spark workloads. In addition, deploying Alluxio to access data solves many challenges that cloud deployments bring with separated compute and storage.
Speakers:
Carlos Queiroz is the Head of Data Platform at Development Bank of Singapore (DBS Bank), where he leads a team to drive the evolution of the data platform. Carlos received his Ph.D. in Computer Science from RMIT University.

Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding tooling systems were not designed in a cloud-native way.
In this talk, we share our lessons in building and rebuilding a cloud-native monitoring system to solve this problem at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all the configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude. A big data problem is reduced to a small data problem for human productivity and operational efficiency.
Speakers:
Du Li, is currently an Architect of Data Infrastructure at Electronic Arts. He worked in academia and industrial labs for many years. Prior to joining EA in mid-2018, he worked at Yahoo and Apple as a senior software engineer.

This session talks about challenges associated with querying diverse data sources at Walmart and how those are tackled using Presto & Alluxio.
How Alluxio caching was leveraged to provide consistent optimized query performance within and across clouds.
Also highlights implementation of critical components for Enterprise acceleration offering such as security integration for fine grained access control, auto-scaling & auto deployment in GCP.
Speakers:
Ashish Tadose is a Principal Software Engineer at WalmartLabs.
He has vast experience in building scalable, high performant products leveraging distributed systems in the DNS, security, ad tech and retail domain. At WalmartLabs he is responsible for building data products to power DataLake and also contribute to designing of overall data and cloud strategy.
He is passionate about Data technologies and solution architecting for large scale data processing systems and likes to explore and contribute to open source technologies.

Moderator:
Steven Mih is the CEO of Alluxio. Steven has extensive industry experience; prior to Alluxio, Steven has held leadership positions at Aviatrix, Mesosphere, and Couchbase. He is passionate about open source and cloud technologies.

Panelists:
Sujatha Sagiraju is the Senior Director of AI Platform at Microsoft Cloud. She is passionate about large scale distributed systems and democratizing, accelerating, and scaling Artificial Intelligence.

Davy Wang now works as the GM in Tencent Cloud Intl, responsible for the Tencent Cloud Over See business.
He previously served as chief architect and solution director of Amazon AWS Greater China, the first batch of getting AWS expert certification. He was a senior technical expert at Ali Yun  and Senior Consulting Manager at IBM

Ben Lorica is the Chief Data Scientist at O’Reilly Media, Inc. and is the Program Director of both the Strata Data Conference and the Artificial Intelligence Conference. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

This talk will discuss best use cases for Presto from the Data Engineer’s perspective. In addition, we will present the recent Presto advancements such as Cost-Based Optimizer, Kubernetes-native deployment and the project roadmap going forward.

Speakers:

Kamil Bajda-Pawlikowski is a technology leader in the large scale data warehousing and analytics space. He is CTO of Starburst, the enterprise Presto company. Prior to co-founding Starburst, Kamil was the Chief Architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto. Previously, he was the co-founder and chief software architect of Hadapt, the first SQL-on-Hadoop company, acquired by Teradata in 2014.

Martin Traverso is the co-creator of Presto and co-founder of the Presto Software Foundation. Previously, he was at Facebook where he led the Presto team.

Enjoy lunch, check out the exhibitors, and mingle with other attendees and speakers!
This hands-on training run by the creators of Presto and Alluxio will cover how to get started with Presto and Alluxio. Attendees will get hands-on experience launching the EC2 instance, exploring the Alluxio filesystem and cluster status, and running queries with Presto on Alluxio where you’ll experience the performance benefits of using Alluxio in your analytics stack.
Presto is a widely popular sql query engine, and it is great for interactive sql analytics. However, when the data is remote or in object stores, performance becomes a challenge. Alluxio can improve Presto’s query performance by using Alluxio as a distributed cache layer co-located with Presto. Presto with Alluxio brings together two open source technologies to give you better performance and multi-cloud capabilities for interactive analytic workloads. Presto’s open source distributed SQL query engine coupled with Alluxio enables true separation of storage and compute for data locality and provides memory speed response time and aggregate data from any file or object store.

Details coming soon!
Speaker:
Roy Ben-Alta is the Head of WW Data & Analytics/ML and Robotics Practice at Amazon Web Services. Roy is an accomplished technology executive with over 15 years of experience in big data and analytics, cloud computing, business intelligence, as well as with hands-on experience in architecture, design, implementation, and operations.
Moderator:
Ben Lorica is the Chief Data Scientist at O’Reilly Media, Inc. and is the Program Director of both the Strata Data Conference and the Artificial Intelligence Conference. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.
Panelists:
Max is the CEO and founder of Preset, the Apache Superset company. He has worked at the leading edge of data and analytics his entire career, helping shape the discipline in influential roles at data-dependent companies like Facebook, Airbnb, Lyft and Yahoo!. A leader in the open-source community, Max is the creator of Apache Superset, a popular open-source data visualization, exploration and analytics platform and Apache Airflow, an open-source tool for orchestrating complex computational workflows and data processing pipelines. More recently he founded Preset, a company devoted to building upon Superset to offer next generation analytics as a service.
Ben Hindman is one of the creators of Apache Mesos, a platform for building and running resource-efficient distributed systems at scale. Ben started working on Mesos as a PhD student at Berkeley before he brought it to Twitter where it runs on thousands of machines. An academic at heart, his research in programming languages and distributed systems has been published in leading academic conferences.
Haoyuan (H.Y.) Li is the Founder and CTO of Alluxio. He graduated with a Computer Science Ph.D. from the AMPLab at UC Berkeley. At the AMPLab, he co-created and led Alluxio (formerly Tachyon), an open source virtual distributed file system. Before UC Berkeley, he got a M.S. from Cornell University and a B.S. from Peking University, all in Computer Science.

Details coming soon!
Speakers:
Gene Pang is the PMC Maintainer of the Alluxio open source project and a founding member of Alluxio, Inc. He graduated with a Ph.D. from the AMPLab at UC Berkeley, working on distributed database systems. Before starting at Berkeley, he worked at Google and has an M.S. from Stanford University, and a B.S. from Cornell University.

tf.data is the recommended API for creating TensorFlow input pipelines and is relied upon by countless external and internal Google users. The API enables you to build complex input pipelines from simple, reusable pieces and makes it possible to handle large amounts of data, different data formats, and perform complex transformations. In this talk, I will present an overview of the project and highlight best practices for creating performant input pipelines.

Speakers:

Jiri is a tech lead of the tf.data project and a software engineer at Google. He holds a PhD from Carnegie Mellon University and throughout his career he has worked on distributed systems and performance, most recently TensorFlow.

Speakers:

At Ryte, we analyze unstructured, semi-structured and structured data for more than one million users worldwide. The whole Ryte-Platform is built with a scalable architecture to support our heavy load and make it possible for our customers to drill-down from a high-level overview into the last byte of their websites.
Speakers:

Speakers:

Lei is the big data architect of Rakuten, Inc. He leads the data platform design and implementation to meet enterprise level data analytic requirements. Also, he is building up a data expert team to deliver high standard solutions. His interests broadly include cloud computing, data analytic, distributed systems, and various open source projects.

Apache Iceberg is a new format for tracking very large scale tables that are designed for object stores like S3. This talk will include why Netflix needed to build Iceberg, the project’s high-level design, and will highlight the details that unblock better query performance.

Speakers:

Ryan Blue works on open source data projects at Netflix. He is one of the original creators of Apache Iceberg, and is a committer in the Apache Spark, Parquet, and Avro communities

Join us for Happy Hour! Grab some bites and a drink from the open bar, mingle with the speakers and other attendees, and stay for the raffle announcement!

Venue anchor

VENUE & DETAILS

Computer History Museum

1401 N Shoreline Blvd
Mountain View, Ca 94043

Google Map Directions

Free onsite parking

sponsors anchor

SPONSORS