Resource Hub

On Demand Videos

On Demand Videos

Tech Talk: Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud

Blog

Blog

Enabling Data Location Awareness for Optimized Performance and Lower Cost With Alluxio Tiered Locality

Caching frequently used data in memory is not a new computing technique, however it is a concept that Alluxio has taken to the next level with the ability to aggregate data from multiple storage systems in a unified pool of memory.

Blog

Blog

Announcing Alluxio 2.0 Preview enabling hyperscale data workloads in the cloud

We are thrilled and excited to announce the availability of Alluxio 2.0 Preview Release - the largest open source release with the most new features and improvements since the creation of the project. It is now available for download. While Alluxio already enabled data locality and data accessibility for many big data workloads in the cloud, there was still innovation needed in key areas.

Blog

Blog

Top 5 Performance Tuning Tips for Presto caching using Alluxio

Presto is an open source distributed SQL engine widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Alluxio is an open-source distributed file system that provides a unified data access layer at in-memory speed. The combination of Presto and Alluxio is getting more popular in many companies like JD, NetEase to leverage Alluxio as distributed caching tier on top of slow or remote storage for the hot data to query, avoiding reading data repeatedly from the cloud. In general, Presto doesn't include a distributed caching tier and Alluxio enables caching of files and objects that the Presto query engine needs.

On Demand Videos

On Demand Videos

Tech Talk: Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage

White Paper

White Paper

Achieving 10x acceleration of Spark and Hive Jobs on AWS S3 with Alluxio Tiered Storage

Blog

Blog

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

In this article, Thai Bui from Bazaarvoice describes how Bazaarvoice leverages Alluxio to build a tiered storage architecture with AWS S3 to maximize performance and minimize operating costs on running Big Data analytics on AWS EC2.

On Demand Videos

On Demand Videos

Tech Talk: Achieving Separation of Compute and Storage in a Cloud World

Blog

Blog

One Click to Benchmark Spark Alluxio S3 Stack with TPCDS queries on AWS

The Alluxio sandbox is the easiest way to test drive the popular data analytics stack of Spark, Alluxio, and S3 deployed in a multi-node cluster in a public cloud environment. The sandbox cluster is fully configured and ready for users to run applications ranging from the hello-world example to the TPC-DS benchmark suite. Don’t take our word for it; kick off the benchmark yourself to see the performance benefits of running Spark jobs that interface through Alluxio on S3 compared to running Spark jobs directly on S3. It is extremely easy to request and launch a sandbox cluster as a playground for 24 hours at no cost to you.

White Paper

White Paper

Effective caching of Spark Resilient Distributed Datasets (RDDs) with Alluxio

Presentation

Presentation

Unified Big Data Analytics: Any Stack, Any Cloud

The big data stack has heavily evolved over the past few years with an explosion of data frameworks starting with MapReduce and expanding to Apache Spark, Presto, Hive on the structured data side as well as TensorFlow, Caffe on AI and ML side. In addition, the approach to managing and storing data has evolved as well starting from HDFS and now moving to newer approaches like object stores. With all the possible combinations of accessing data, data engineering has become increasingly complex, particularly in the hybrid and multi-cloud environments. Users are increasingly adding a new layer to their data stack that unifies files and objects and provides data locality across separated compute and storage environments.

This is the fundamental problem Alluxio solves. Alluxio is an open-source virtual distributed file system that provides a unified data access layer for hybrid and multi-cloud deployments. Alluxio enables distributed compute engines like Spark, Presto or Machine Learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure and etc) while actively leveraging in-memory cache to accelerate data access. Developed originally from UC Berkeley AMPLab as research project “Tachyon”, Alluxio has more than 900 contributors and is used by over 100 companies worldwide with the largest production deployment over 1000 nodes.

This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.

Blog

Blog

Alluxio Developer Tip Why am I seeing the error User yarn is not configured for any impersonation impersonationUser foo

Impersonation is simply the ability for one user to act on behalf of another user. For example, say user ‘yarn’ has the credentials to connect to a service, but user ‘foo’ does not. Therefore, user ‘foo’ would never be able to access the service. However, user ‘yarn’ can access the service and impersonate (act on behalf of) user ‘foo’, allowing access to user ‘foo’. Therefore, impersonation enables one user to access a service on behalf of another user. The impersonation feature defines how users can act on behalf of other users. Therefore, it is important to know who the users are.

White Paper

White Paper

Testing Distributed Systems in the Big Data Ecosystem at 1000+ node Scale

Blog

Blog

Testing Distributed Systems at 1000 node Scale for the Cost of a Large Pizza and yes on AWS

Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.

Blog

Blog

Presto on Alluxio How Netease Games leveraged Alluxio to boost ad hoc SQL on HDFS

Netease Games is the operator for many popular online games in China like "World of Warcraft" and "Hearthstone". Netease Games also has developed quite a few popular games on its own such as "Fantasy Westward Journey 2", "Westward Journey 2", "World 3", "League of Immortals". The strong growth of the business drives the demand to build and maintain a data platform handling a massive amount of data and delivering insights promptly from the data. Given our data scale, it is very challenging to support high-performance ad-hoc queries to the data with results generated in a timely manner.

Blog

Blog

Top 10 Tips for Making the Spark Alluxio Stack Blazing Fast

The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices. This article provides the top 10 tips for performance tuning for real-world workloads when running Spark on Alluxio with data locality giving the most bang for the buck.

Presentation

Presentation

Alluxio – Virtual Unified File System

Presentation

Presentation

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

ALLUXIO BAY AREA MEETUP 2018

Alluxio is an open-source distributed file system that provides data ecosystems a unified data access layer at in-memory speed. Alluxio enables compute engines like Spark, Presto, MapReduce, TensorFlow to transparently access different persistent storage systems (including HDFS, S3) while actively leveraging in-memory cache to accelerate data access. As a result, Alluxio simplifies the development and management of big data and ML workloads with lower cost and better performance. Alluxio has more than 900 contributors and is used by over 100 companies worldwide. Andrew will give an overview of Alluxio’s core concepts, architecture, data flow, and production use cases.

Blog

Blog

Deploying Big Data Workloads on Object Storage Without Performance Penalty

As the amount of data being collected and analyzed by Enterprises continues to grow unabated, more attention is being placed on managing the cost of storing the data relative to performance. Hadoop provides a scalable and fast way of storing and analyzing data, however, the cost of storing data in Hadoop is typically higher compared to alternative technologies like Object Stores.

Blog

Blog

Developer Tip Why Did My Job Fail with Error Message Class alluxiohadoopFileSystem not found

From time to time, a question pops up on the user mailing list referencing job failures with the error message "java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found". This post explains the reason for the failure and the solution to the issue when it occurs. This error indicates the Alluxio client is not available at runtime. This causes an exception when the job tries to access the Alluxio filesystem but fails to find the implementation of Alluxio client to connect to the service.

White Paper

White Paper

Alluxio Overview: Open Source Data Orchestration Technology

Learn how Alluxio’s unified namespace for data distributed across private data centers and clouds improves performance and lowers costs.

White Paper

White Paper

Alluxio Architecture and Data Flow

Blog

Blog

How To Speed Up Alluxio Metadata Operations Up To 100X

This blog describes our experience in speeding up Alluxio metadata operations using fingerprint and Alluxio under store bulk operations. These latest optimizations can be found in the 1.8.1 release. One of the major values Alluxio provides is a simple and unified interface to manage files and directories on different underlying storage systems. Alluxio acts as an intermediate layer and exposes a file interface for applications to interact with, even though the underlying storage system might be an object store that has a different interface.

Presentation

Presentation

Intel: How to Use Alluxio to Accelerate Big Data Analytics on the Cloud and New Opportunities with Persistent Memory

To further optimize Spark on disaggregated cloud storage and to benefit from rapid provisioning, excellent scalability, easy management, and pay as you grow flexibility, we added an “In-Memory Data Acceleration” layer to support big data filesystem operation natively and better utilize memory to improve the performance.

We tested deploying Alluxio with five 200 GB Memory. All Alluxio tests are based on the disaggregated S3A Ceph cloud storage configuration, enabling us to see the exact performance improvement after adding the in-memory data acceleration.

The results showed that ;both configurations provide a significant performance improvement.

For batch queries, performance with Alluxio shows more than 1.42 times improvement compared with disaggregated S3A Ceph cloud storage and similar performance to a traditional on-premise configuration. For the I/O intensive workload on Terasort, performance with Alluxio shows more than a 3.5 times improvement. And when compared with traditional on-premise configuration, disaggregated S3A Ceph cloud storage with Alluxio shows a 1.4 times performance improvement in the Terasort test. For CPU intensive workload using K-Means, performance with Alluxio shows 1.4 times improvement while compared to traditional on-premise configuration and performance with Alluxio disaggregate S3A Ceph cloud storage still indicates 10% worse than traditional on-premise configuration.

So, from the above data, we can conclude that using Alluxio as the cache can eliminate the performance overhead of S3A and there is still a benefit when deploying big data on cloud storage. When the workload is I/O intensive, it is even more beneficial to adopt Alluxio as the cache.

Your selections don't match any items.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

Resource Hub

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer