Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.
Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.
This meetup presents an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release, and Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop.
Today, real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to dramatically reduce the workload pressure on our HDFS NameNode.
The purpose of Alluxio is to be an abstraction layer with storage systems underneath it. Alluxio is designed in a way that it assumes that there’s a storage layer underneath, so using it as another storage system does not solve the problem of having storage and compute co-located. Alluxio allows you to have long-running data … Continued
This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.
Alluxio is a proud sponsor and exhibitor at the Presto Summit in San Francisco.
What’s Presto Summit? It’s the leading Presto conference co-organized by our partner Starburst Data and the Presto Software Foundation.
Some people experience serious performance issue in HDFS namenode (v2.7) response time. Particularly during peak traffic time, an HDFS namenode can become overloaded and some DFS operations (like listing a directory) can take a long time, which affects the query response time for Presto and other Hadoop applications. To solve for challenges in high latency … Continued
What is Apache Hadoop If you’re new to building big data applications, Apache Hadoop is a distributed framework for managing data processing and storage for big data applications running in clustered systems. It consists of 5 modules – a distributed file system (aka HDFS or Hadoop Distributed File System), MapReduce for parallel processing of datasets, … Continued