Tech Talk Slide Deck

Alluxio at Strata + Hadoop World San Jose 2017

STRATA + HADOOP WORLD SAN JOSE 2017

Effective Spark with Alluxio

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for storing data and accelerating access to data in different storage systems. Alluxio has a quickly growing open source community of developers and users and is deployed at such organizations as Alibaba, Baidu, Barclays, Intel, Huawei, and Qunar. Many of these deployments use Alluxio with Spark, and some of them scale out to over PBs of data.

While Spark is already gaining great adoption, Alluxio can enable Spark to be even more effective. Alluxio bridges Spark applications with various storage systems and further accelerates data-intensive applications. Calvin Jia introduces Alluxio, explain how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments with both Alluxio and Spark working together. Along the way, Calvin offers live demos to illustrate Alluxio’s use cases.


Alluxio: Unify Data at Memory Speed

In the past year, the Alluxio project experienced significant improvement in performance and scalability and was extended with key new features including tiered storage, transparent naming, and unified namespace. At the same time, the Alluxio ecosystem has expanded to include support for more under storage systems and computation frameworks. In particular, Alluxio now supports a wide range of under storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift. These integrations make it possible for Alluxio to be leveraged in many different environments.

Haoyuan Li and Calvin Jia explore Alluxio’s goal of making its product accessible to an even wider set of users through a focus on security, new language bindings, and further increased stability. Haoyuan and Gene also cover some new APIs Alluxio is working on to allow applications to access data more efficiently and manage data across different under storage systems.