FEATURED SPEAKERS

Ryan Blue

Netflix

Sr. Software Engineer

Carlos Queiroz

DBS Bank

Head of Data Platform

Du Li

Electronic Arts

Software Engineer

Haoyuan li

Alluxio

Founder & CTO

Maxime Beauchemin

Apache Airflow and Superset

Founder

Ben Lorica

O’Reilly

Chief Data Scientist

Roy Ben-Alta

Amazon Web Services

Head Of WW Data
& Analytics

Ashish Tadose

WalmartLabs

Staff Enginer

Davy Wang

Tencent

GM of Tencent Cloud

Swati Gharse

Microsoft Cloud

Principal Product Manager in Azure Machine Learning

Jiri Simsa

Google

Software Engineer

Kamil
Bajda-Pawlikowski

Starburst, Presto Company

Co-Founder & CTO

Amelia Wong

Alluxio

Co-Founder & Conference Chair

Martin Traverso

Presto Software Foundation

Co-Founder

Bin Fan

Alluxio

Founding Member, VP Open Source

Tobi Knaup

D2iQ (Mesosphere)

CTO & Co-Founder

Gene Pang

Alluxio

Founding Engineer, Head of Architecture

Steven Mih

Alluxio

CEO

Lei Ai

Rakuten

Big Data Architect

Bill Zhao

Apple

Software Engineer

Danny linden

Ryte

Chapter Lead Engineer

Rong Gu

Nanjing University

Professor of Computer Science

Calvin Jia

Alluxio

Founding Engineer, Head of Architecture

Eric (peng) li

Alibaba

Architect of AlibabaCloud Container Service

See All Speakers

Call for proposals closed

schedule

PROGRAM

Join us for continental breakfast and pick up some swag!

Summit hosts would like to greet you all!

In this talk, HY will discuss the key challenges and trends impacting data engineering, and explore the concept of Data Orchestration. More details to come!
Speakers:
Haoyuan (H.Y.) Li is the Founder, CTO, and Chairman of Alluxio. He graduated with a Computer Science Ph.D. from the AMPLab at UC Berkeley. At the AMPLab, he co-created and led Alluxio (formerly Tachyon), an open source virtual distributed file system. Before UC Berkeley, he got a M.S. from Cornell University and a B.S. from Peking University, all in Computer Science.

The big data stack has evolved over the past few years with an explosion of data frameworks, starting with MapReduce and expanding to Apache Spark and Presto. The approach to managing and storing data has evolved as well, starting from using primarily Hadoop distributed file system (HDFS) to newer, cheaper, and easier technologies like object stores.
In this talk, Carlos will dive into how DBS Bank built a modern big data analytics stack, leveraging an object store as persistent storage even for data-intensive workloads, and how it uses Alluxio to orchestrate data locality and data access for Spark workloads. In addition, deploying Alluxio to access data solves many challenges that cloud deployments bring with separated compute and storage.
Speakers:
Carlos Queiroz is the Head of Data Platform at Development Bank of Singapore (DBS Bank), where he leads a team to drive the evolution of the data platform. Carlos received his Ph.D. in Computer Science from RMIT University.

This session talks about challenges associated with querying diverse data sources at Walmart and how those are tackled using Presto & Alluxio.
How Alluxio caching was leveraged to provide consistent optimized query performance within and across clouds.
Also highlights implementation of critical components for Enterprise acceleration offering such as security integration for fine grained access control, auto-scaling & auto deployment in GCP.
Speakers:
Ashish Tadose is a Principal Software Engineer at WalmartLabs.
He has vast experience in building scalable, high performant products leveraging distributed systems in the DNS, security, ad tech and retail domain. At WalmartLabs he is responsible for building data products to power DataLake and also contribute to designing of overall data and cloud strategy.
He is passionate about Data technologies and solution architecting for large scale data processing systems and likes to explore and contribute to open source technologies.

Check out the exhibitors, drop in on office hours and ask an expert, or mingle!

Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding systems were not designed in a cloud-native way.
In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude.
In the first generation of the data platform, we used HDFS as a cache layer between ETL jobs and the underlying AWS storage service S3. However, HDFS is not a special-purpose cache service, so custom code is needed to make it work like a cache. We have to run a backup workflow in every ETL job to backup data to S3 and sync the metadata store of the ETL jobs running on HDFS and that of interactive analytic queries running directly on S3. Moreover, we rely on complex and fragile mechanisms for purging datasets when the clusters are under heavy load. The use of HDFS also makes it a challenge to rapidly scale up the YARN cluster during peak hours and scale it down during off-hours. We are currently redesigning the data platform, mainly by replacing HDFS with a special-purpose data orchestration service called Alluxio. In our initial evaluation, Alluxio not only provides better performance than HDFS but also significantly simplifies the architecture of our data platform and makes it easy to scale up and down and paves the way to a cloud native ETL processing stack.
Speakers:
Du Li, is currently an Architect of Data Infrastructure at Electronic Arts. He worked in academia and industrial labs for many years after earning his PhD degree from UCLA. Prior to joining EA in mid-2018, he was a software engineer at Yahoo and Apple.

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

This talk will discuss best use cases for Presto from the Data Engineer’s perspective. In addition, we will present the recent Presto advancements such as Cost-Based Optimizer, Kubernetes-native deployment and the project roadmap going forward.

Speakers:

Kamil Bajda-Pawlikowski is a technology leader in the large scale data warehousing and analytics space. He is CTO of Starburst, the enterprise Presto company. Prior to co-founding Starburst, Kamil was the Chief Architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto. Previously, he was the co-founder and chief software architect of Hadapt, the first SQL-on-Hadoop company, acquired by Teradata in 2014.

Martin Traverso is the co-creator of Presto and co-founder of the Presto Software Foundation. Previously, he was at Facebook where he led the Presto team.

Enjoy lunch, check out the exhibitors, and mingle with other attendees and speakers!

This hands-on training run by the creators of Presto and Alluxio will cover how to get started with Presto and Alluxio. Attendees will get hands-on experience launching the EC2 instance, exploring the Alluxio filesystem and cluster status, and running queries with Presto on Alluxio where you’ll experience the performance benefits of using Alluxio in your analytics stack.
Presto is a widely popular sql query engine, and it is great for interactive sql analytics. However, when the data is remote or in object stores, performance becomes a challenge. Alluxio can improve Presto’s query performance by using Alluxio as a distributed cache layer co-located with Presto. Presto with Alluxio brings together two open source technologies to give you better performance and multi-cloud capabilities for interactive analytic workloads. Presto’s open source distributed SQL query engine coupled with Alluxio enables true separation of storage and compute for data locality and provides memory speed response time and aggregate data from any file or object store.

Alluxio core maintainers and founding engineers will share the latest innovations in Alluxio 2. More details coming!
Speakers:

Calvin Jia is the top contributor of the Alluxio project. He has been involved as a core maintainer and release manager since the early days when the project was known as Tachyon. Calvin has a B.S. from the UC, Berkeley.

Bin Fan is the founding engineer, VP of OS of Alluxio, and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google where he won the Technical Infrastructure Award. Bin received his Ph.D. in Computer Science from Carnegie Mellon University working on distributed systems

At Ryte, we analyze unstructured, semi-structured and structured data for more than one million users worldwide. The whole Ryte-Platform is built with a scalable architecture to support our heavy load and make it possible for our customers to drill-down from a high-level overview into the last byte of their websites.
Speakers:

Moderator:
Steven Mih is the CEO of Alluxio. Steven has extensive industry experience; prior to Alluxio, Steven has held leadership positions at Aviatrix, Mesosphere, and Couchbase. He is passionate about open source and cloud technologies.

Panelists:
Swati Gharse is a Principal Product Manager on the Azure Machine Learning platform in the Cloud + AI group at Microsoft, focusing on Automated Machine Learning & hyperparameter tuning. Prior to this role, she has worked on several products at Microsoft, including Bing and MSN. She is passionate about building products that can democratize & accelerate AI.

Davy Wang now works as the GM in Tencent Cloud Intl, responsible for the Tencent Cloud Over See business.
He previously served as chief architect and solution director of Amazon AWS Greater China, the first batch of getting AWS expert certification. He was a senior technical expert at Ali Yun  and Senior Consulting Manager at IBM

Ben Lorica is the Chief Data Scientist at O’Reilly Media, Inc. and is the Program Director of both the Strata Data Conference and the Artificial Intelligence Conference. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.

Speakers:

Lei is the big data architect of Rakuten, Inc. He leads the data platform design and implementation to meet enterprise level data analytic requirements. Also, he is building up a data expert team to deliver high standard solutions. His interests broadly include cloud computing, data analytic, distributed systems, and various open source projects.

The Data Flywheel is a comprehensive and additive approach for business and technology leaders to enable organizations to get the most value from their data.
In this session, we will share common design patterns AWS customers are applying as part of their Data and AI journey. It will include real world examples.
Speaker:
Roy Ben-Alta is the Head of WW Data & Analytics/ML and Robotics Practice at Amazon Web Services. Roy is an accomplished technology executive with over 15 years of experience in big data and analytics, cloud computing, business intelligence, as well as with hands-on experience in architecture, design, implementation, and operations.
Check out the exhibitors and mingle!

Moderator:
Ben Lorica is the Chief Data Scientist at O’Reilly Media, Inc. and is the Program Director of both the Strata Data Conference and the Artificial Intelligence Conference. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.
Panelists:
Max is the CEO and founder of Preset, the Apache Superset company. He has worked at the leading edge of data and analytics his entire career, helping shape the discipline in influential roles at data-dependent companies like Facebook, Airbnb, Lyft and Yahoo!. A leader in the open-source community, Max is the creator of Apache Superset, a popular open-source data visualization, exploration and analytics platform and Apache Airflow, an open-source tool for orchestrating complex computational workflows and data processing pipelines. More recently he founded Preset, a company devoted to building upon Superset to offer next generation analytics as a service.
Tobi Knaup is the CTO and Co-Founder of Mesosphere, the hybrid cloud platform company which helps organizations adopt transformative technologies like container orchestration, machine learning and real-time analytics. He was one of the first engineers and tech lead at Airbnb. At Airbnb, he wrote large parts of the infrastructure including the search and fraud prediction services. He helped scale the site to millions of users and build a world class engineering team. Tobi is the main author of Marathon, the world’s first open source container orchestrator. He also co-created KUDO, an open source toolkit for building Kubernetes Operators.
Haoyuan (H.Y.) Li is the Founder and CTO of Alluxio. He graduated with a Computer Science Ph.D. from the AMPLab at UC Berkeley. At the AMPLab, he co-created and led Alluxio (formerly Tachyon), an open source virtual distributed file system. Before UC Berkeley, he got a M.S. from Cornell University and a B.S. from Peking University, all in Computer Science.

Alluxio founding engineer and core contributor Gene Pang will share the latest innovations for structured data in Alluxio open source.
Speakers:
Gene Pang is the PMC Maintainer of the Alluxio open source project and a founding member of Alluxio, Inc. He graduated with a Ph.D. from the AMPLab at UC Berkeley, working on distributed database systems. Before starting at Berkeley, he worked at Google and has an M.S. from Stanford University, and a B.S. from Cornell University.

Speakers:

Alibaba Cloud Container Service Senior Architect, focusing on the design and development of Kubernetes-based container product storage, security, and telescopic scheduling, is a large-scale Internet containerization, persistent data service, AI container platform, and bio-hybrid cloud landing practitioner. Data Scientist US FDA2018 Precision Medical Competition Top Winner.

Apache Iceberg is a new format for tracking very large scale tables that are designed for object stores like S3. This talk will include why Netflix needed to build Iceberg, the project’s high-level design, and will highlight the details that unblock better query performance.

Speakers:

Ryan Blue works on open source data projects at Netflix. He is one of the original creators of Apache Iceberg, and is a committer in the Apache Spark, Parquet, and Avro communities

tf.data is the recommended API for creating TensorFlow input pipelines and is relied upon by countless external and internal Google users. The API enables you to build complex input pipelines from simple, reusable pieces and makes it possible to handle large amounts of data, different data formats, and perform complex transformations. In this talk, I will present an overview of the project and highlight best practices for creating performant input pipelines.

Speakers:

Jiri is a tech lead of the tf.data project and a software engineer at Google. He holds a PhD from Carnegie Mellon University and throughout his career he has worked on distributed systems and performance, most recently TensorFlow.

Join us for Happy Hour! Grab some bites and a drink from the open bar, mingle with the speakers and other attendees, and stay for the raffle announcement!

Venue anchor

VENUE & DETAILS

Computer History Museum

1401 N Shoreline Blvd
Mountain View, Ca 94043

Google Map Directions

Free onsite parking

sponsors anchor

SPONSORS