mapreduce Archives

Accelerating Write-intensive Data Workloads on AWS S3

August 7, 2019 By Zac Blanco and Bin Fan

Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

Developer Tip: Why Did My Job Fail with Error Message “Class alluxio.hadoop.FileSystem not found”?

October 30, 2018 By Bin Fan

From time to time, a question pops up on the user mailing list referencing job failures with the error message “java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found”. This post explains the reason for the failure and the solution to the issue when it occurs.
This error indicates the Alluxio client is not available at runtime. This causes an exception when the job tries to access the Alluxio filesystem but fails to find the implementation of Alluxio client to connect to the service.

TalkingData: Leading Data Broker in China Leverages Alluxio to Unify Terabytes of Data Across Disparate Data Sources

June 25, 2018 By Zhitao Yan (TalkingData)

TalkingData leverages Alluxio as a single platform to manage all the data across disparate data sources on-premise and in the cloud. Alluxio removes the complexity of our environment by abstracting the different data sources and providing a unified interface. Applications simply interact with Alluxio, and Alluxio manages data access to different storage systems on behalf of the applications. Alluxio effectively democratizes data access, allowing data scientists and analysts in various business units to accomplish their goals without needing to consider where the data is located or having to go to central IT or the engineering team to transfer or prepare the data.

A Reliable Memory-Centric Distributed Storage System

October 16, 2015 by Haoyuan Li

Tachyon: A reliable memory-centric distributed storage system presentation by founder Haoyuan Li.

Tags: apache spark, big data, data, hadoop, mapreduce, performance, spark, storage

Tachyon: A Reliable Memory-Centric Distributed Storage System

July 15, 2015 by Bin Fan

We introduce Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.

Tags: apache spark, big data, data, hadoop, mapreduce, performance, spark, storage

Tag: mapreduce