data locality Archives

A Journey Towards Data Locality on Cloud for Machine Learning and AI

December 18, 2023 By Lu Qiu and Shawn Sun

In this blog, we discuss the importance of data locality for efficient machine learning on the cloud. We examine the pros and cons of existing solutions and the tradeoff between reducing costs and maximizing performance through data locality. We then highlight the new-generation Alluxio design and implementation, detailing how it brings value to model training … Continued

Alluxio Use Cases Overview: Unify silos with Data Orchestration

October 19, 2021 By Hope Wang, Adit Madan and Bin Fan

This blog is the first in a series introducing Alluxio as the data platform to unify data silos across heterogeneous environments. The next blog will include insights from PrestoDB committer Beinan Wang to uncover the value for analytics use cases, specifically with PrestoDB as the compute engine.

Integrating Open Source Alluxio in AWS EKS with Terraform

Global Online Meetup * March 30, 2021

The presentation talks about the best practices to set up and techniques to build a cluster with open source Alluxio on AWS EKS, for one of our clients, which made it Scalable, Reliable, and Secure by adapting to Kubernetes RBAC.

Accelerating Data Computation on Ceph Objects using Alluxio

November 11, 2020

In this talk, we will present how using Alluxio computation and storage ecosystems can better interact benefiting the “bringing the data close to the code” approach. Moving away from the complete disaggregation of computation and storage, data locality can enhance the computation performance. During this talk, we will present our observations and testing results that will show important enhancements in accelerating Spark Data Analytics on Ceph Objects Storage using Alluxio.

Tags: ceph, compute, data locality, distributed storage, meetup, object storage, storage

Accelerating Data Computation on Ceph Objects using Alluxio

Alluxio Global Online Meetup * November 10, 2020

In this talk, we will present how using Alluxio computation and storage ecosystems can better interact benefiting of the “bringing the data close to the code” approach. Moving away from the complete disaggregation of computation and storage, data locality can enhance the computation performance.

Building a Cross-Region Hybrid Cloud Storage Gateway for Machine Learning & AI at WeRide

July 8, 2020 By Derek Tan (WeRide) and Jasmine Wang

In this blog, Derek Tan, Executive Director of Infra & Simulation at WeRide, describes how engineers leverage Alluxio as a hybrid cloud data gateway for applications on-premises to access public cloud storage like AWS S3.

Bursting Spark or Presto Jobs to AWS using Alluxio

Community Online Office Hour * June 23, 2020

In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.

Bursting Apache Spark Workloads to the Cloud on Remote Data

Community Online Office Hour * March 10, 2020

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

Tag: data locality