Blog

Alluxio Blog

Deploying Big Data Workloads on Object Storage Without Performance Penalty

As the amount of data being collected and analyzed by Enterprises continues to grow unabated, more attention is being placed on managing the cost of storing the data relative to performance. Hadoop provides a scalable and fast way of storing and analyzing data, however, the cost of storing data in Hadoop is typically higher compared to alternative technologies like Object Stores.

Developer Tip: Why Did My Job Fail with Error Message “Class alluxio.hadoop.FileSystem not found”?

From time to time, a question pops up on the user mailing list referencing job failures with the error message “java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found”. This post explains the reason for the failure and the solution to the issue when it occurs.
This error indicates the Alluxio client is not available at runtime. This causes an exception when the job tries to access the Alluxio filesystem but fails to find the implementation of Alluxio client to connect to the service.

New York Meetup Recap – September 2018

we held our first New York City Alluxio Meetup! Work-Bench was very generous for hosting the Alluxio meetup in Manhattan. This was the first US Alluxio meetup outside of the Bay Area, so it was extremely exciting to get to meet Alluxio enthusiasts on the east coast!
The meetup focused on users of Alluxio with different applications from Hive and Presto. As an introduction, Haoyuan Li (creator and founder of Alluxio) and Bin Fan (founding engineer of Alluxio) gave an overview of Alluxio and the new features and enhancements of the new v1.8.0 release.

A Better Big Data Ecosystem with Hadoop and Hitachi Content Platform (Part1)

This blog explores the challenges customers are facing with storing data long term in Hadoop, and discusses what the Hitachi Content Platform team is doing to help our customers solve these challenges with the help of Alluxio.
Data is at the center of our digital world and for years Hadoop has been the go-to data processing platform because it is fast and scalable. While Hadoop has solved the data storage and processing problem for the last ~10 years, it achieves this by scaling storage and compute capacity in parallel. As a result, Hadoop environments have continued to expand compute capacity well beyond their needs as more and more of the storage is consumed by older, inactive data.

Effective caching for Spark RDDs with Alluxio

Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their case study, they described how Alluxio improved their system architecture, and mentioned that some existing Spark jobs would slow down or would never finish because they would run out of memory. After using Alluxio, those jobs were able to finish, because the data could be stored in Alluxio, instead of within Spark.
In this blog, we show by saving RDDs in Alluxio, Alluxio can keep larger data sets in-memory for faster Spark applications, as well as enable sharing of RDDs across separate Spark applications.

Announcing Alluxio v1.8.0

We are excited to announce the release of Alluxio Enterprise Edition (AEE) and Community Edition (ACE) and Alluxio Open Source (AOS) v1.8.0. Click HERE to download! This release brings features and enhancements in Alluxio to simplify cloud adoption (and hybrid cloud, and migration from HDFS to object storage) for analytics and machine learning and improve useability.
To help make it easier to get started using Alluxio, we have also collected a set of resources into a starter kit. The second item is a simple tutorial for how to mount a remote AWS S3 bucket and accelerate data access.

Data Location Awareness: Optimize Performance and Lower Cost with Tiered Locality

Caching frequently used data in memory is not a new computing technique, however it is a concept that Alluxio has taken to the next level with the ability to aggregate data from multiple storage systems in a unified pool of memory. Alluxio capabilities extend further to intelligently managing the data within that virtual data layer. Tiered locality uses awareness of network topology and configurable policies to manage data placement for performance and cost optimizations. This feature is particularly useful with cloud deployments across multiple availability zones. It can also be useful for cost savings in environments where cross-zone or cross-location traffic is more expensive than intra-zone data traffic.

Asynchronous Caching in Alluxio – High Performance for Partial Read Caching for Presto and Spark

An Alluxio cluster caches data from connected storage systems in memory to create a data layer that can be accessed concurrently by multiple application frameworks. This greatly improves performance for many analytics workloads. On-demand caching occurs when clients read blocks of data using a ‘CACHE’ read type from persistent storage systems connected to the Alluxio cluster.
Prior to Alluxio v1.7, on-demand caching was on the critical path of read operations, requiring a full block to be read before the data was available for the application. Workloads which read partial blocks, for example SQL workloads, would be adversely affected on initial reads from connected storage.