Using intermediate APIs means developers can learn just one framework and still access features offered by different technologies. It means writing job logic only once and being able to test it easily on a new underlying service with no effort. Not only is modularity a win for users but it means creators of execution frameworks and storage systems can focus on performance and capability without having to worry about API maintenance.
Tag: compute storage separation
Enterprises typically store large amounts of data in existing storage systems, which are often separate from big data analytics systems. Therefore, importing petabytes of data into a big data analytics system takes a long time with large overheads and high costs. Even worse, transferring large amounts of data results in data silos and unnecessary duplication, which creates serious data management problems.
Alluxio is the first memory-speed virtual distributed storage system in the world. It unifies the interface between the various computing frameworks and under storages. Data access can be several magnitude faster because of Alluxio’s memory-centric architecture. In addition, Alluxio’s tiered storage, unified namespace, flexible file API, web UI, and command-line tools increase the usability in different application scenarios.
Qunar has been running Alluxio in production for over a year. Lei Xu explores how stream processing on Alluxio has led to a 16x performance improvement on average and 300x improvement at service peak time on workloads at Qunar.
The goal is to make Alluxio accessible to an even wider set of users through a focus on security, new language bindings, and further increased stability. In addition, the team is working on new APIs to allow applications to access data more efficiently and manage data across different under storage systems.
Using Alluxio to Improve Spark & Hadoop HDFS System Performance and Reliability [Chinese]
In this talk, we briefly introduce Alluxio, present several ways how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments both Alluxio and Spark working together. In the meantime, we will provide live demos for some of the use cases.
Haoyuan Li explores Alluxio’s goal of making its product accessible to an even wider set of users, through a focus on security, new language bindings, and further increased stability. Haoyuan also covers some new APIs Alluxio is working on to allow applications to access data more efficiently and manage data across different under storage systems.
In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.
An overview of Alluxio basics, demonstrating how Alluxio works and how to use this system to enable distributed computation engines (like Spark or MapReduce) to share data at memory speed. Using hands-on exercises, Yupeng and Rong walk you through deploying and running Alluxio, mounting external storage systems (like S3) into Alluxio’s namespace, interacting Alluxio with built-in commands and WebUI, and building simple big data applications using common computation frameworks (e.g., Apache Spark and Hadoop MapReduce) to read from and write to Alluxio.