Speeding Up Machine Learning in the Cloud with Alluxio
Speeding Up Machine Learning in the Cloud with Alluxio
The latest Alluxio meetups, webinars, conferences and more
Speeding Up Machine Learning in the Cloud with Alluxio
Alluxio has run in JD.com’s production environment on 100 nodes for six months. Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. This work has also extended Alluxio and enhanced the syncing between Alluxio and HDFS for consistency.
The future is the era of data, and the abstraction of efficient management, storage, and access to data is undoubtedly the cornerstone of this era. Open source distributed virtual data system Alluxio is dedicated to providing simple and efficient data abstraction, convenient data sharing and high-speed I/O for big data, machine learning, and artificial intelligence, while keeping applications and data persistent and providing rich Storage system selection. After several years of development, Alluxio was developed from a prototype of a research project involving only a few Ph.D. students and researchers in the AMPLab at the University of California, Berkeley, to more than 800 code contributors (Alluxio 1.8 release data), and deployed in Tencent. Baidu, JD, Two-Sigma, Barclays Bank and other hundreds of Chinese and foreign industry leaders in the production environment, become an important part of the data platform and data infrastructure.
A10 Big Data Application Summit
The future is the era of data, and the abstraction of efficient management, storage, and access to data is undoubtedly the cornerstone of this era. Open source distributed virtual data system Alluxio is dedicated to providing simple and efficient data abstraction, convenient data sharing and high-speed I/O for big data, machine learning, and artificial intelligence, while keeping applications and data persistent and providing rich Storage system selection.
After several years of development, Alluxio was developed from a prototype of a research project involving only a few Ph.D. students and researchers in the AMPLab at the University of California, Berkeley, to more than 800 code contributors (Alluxio 1.8 release data), and deployed in Tencent. Baidu, JD, Two-Sigma, Barclays Bank and other hundreds of Chinese and foreign industry leaders in the production environment, become an important part of the data platform and data infrastructure.
This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.
Cloud object storage systems provide different semantics and performance implications compared to HDFS. Applications like Presto cannot benefit from the node-level locality or cross-job caching when reading from the cloud. Deploying Alluxio with Presto to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly. Bin will present the architecture to combine Presto with Alluxio with use cases from major internet companies like JD.com and NetEase.com, and their lessons learned to operate this architecture at scale.
We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.
Alluxio 2.0 is the most ambitious platform upgrade since the inception of Alluxio with greatly expanded capabilities to empower users to run analytics and AI workloads on private, public or hybrid cloud infrastructures leveraging valuable data wherever it might be stored. This preview release, now available for download, includes many advancements that will allow users to push the limits of their data-workloads in the cloud.