This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.
Tag: machine learning
This talk was presented by Alluxio’s top contributor and PMC Maintainer Calvin Jia at the Alluxio bay area Meetup.
This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.
Problem It becomes increasingly more popular among data scientists to train models based on frameworks like TensorFlow on a local server or cluster while using remote shared storages like S3 or Google Cloud Storage to store a massive amount of the input data. This stack provides high flexibility and cost efficiency, especially requires no dev-ops … Continued
A new generation of open source big data, represented by Alluxio, born at the University of California at Berkeley, looks at this issue. Different from systems such as designing storage tight coupling to achieve low-cost reliable storage HDFS, by providing a virtual data storage layer defined and implemented by software for data applications, abstracting and integrating cloudy, hybrid cloud, multi-data center and other environments The underlying files and objects, and through intelligent workload analysis and data management, make data close to computing and provide data locality, big data and machine learning applications can be achieved with the same performance and lower cost.
In this talk, we will focus on Alluxio design, its architecture, data flow and metadata flow. We will dive into the choices in its design space and share the experiences when implementing features like data tiering, storage options and cache eviction policies. We will also share our lessons in design, implementation and operation when working to build an open source distributed storage systems with 900 contributors for 5+ years.
Enterprises are increasingly looking towards object stores to power their big data & machine learning workloads in a cost-effective way. The combination of SwiftStack and Alluxio together, enables users to seamlessly move towards a disaggregated architecture. Swiftstack provides a massively parallel cloud object storage and multi-cloud data management system. Alluxio is a data orchestration layer, which sits between compute frameworks and storage systems and enables big data workloads to be deployed directly on SwiftStack. Alluxio provides data locality, accessibility and elasticity via its core innovations. With the Alluxio and Swiftstack solution, Spark, Presto, Tensorflow and Hive and other compute workloads can benefit from 10X performance improvement and dramatically lower costs.
Alluxio can help data scientists and data engineers interact with different storage systems in a hybrid cloud environment. Using Alluxio as a data access layer for Big Data and Machine Learning applications, data processing pipelines can improve efficiency without explicit data ETL steps and the resulting data duplication across storage systems.
What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark – Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab – now RISElab).
A few months ago, Baidu deployed Alluxio to accelerate its big data analytics workload. Bin Fan and Haojun Wang explain why Baidu chose Alluxio, as well as the details of how they achieved a 30x speedup with Alluxio in their production environment with hundreds of machines. Based on the success of the big data analytics engine, Baidu is currently expanding the Alluxio and Spark infrastructure to accelerate other applications, such as machine learning.