Some compute frameworks like Apache Spark have single node-caching, how is Alluxio different than single-node caching?

While single node caching may be sufficient for some users, for many it does not improve the performance meaningfully. By definition, a single node cache is limited to what that single node has accessed. Also, most frameworks with a single node cache typically do not leverage the SSD or HDD in the node.

Alluxio is different in that it is the only data orchestration layer that provides data locality with a distributed cache, API compatibility with popular big data frameworks, and a global namespace to aggregate data from several data silos. Alluxio allows you to aggregate available cluster resources and present them to these frameworks as a distributed tiered caching layer. Alluxio can be configured to replicate hot data and intelligently tier the data across the RAM, SSD, HDD, and even a backing store like S3. The multi-tiering increases the amount of data that can be local to the worker nodes, reducing job completion times and providing consistent performance with object stores like AWS S3.

Tags: apache spark, caching

Alluxio and Compute Answers