Alluxio on EMR: Fast Storage Access and Sharing for Spark Jobs

This is a guest blog by Chengzhi Zhao with an original blog source. Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. … Continued

Speeding Big Data Analytics on the Cloud with In-Memory Data Accelerator

This is a guest blog from a team of Intel engineers, originally published as a white paper https://software.intel.com/en-us/articles/speed-big-data-analytics-on-the-cloud-with-an-in-memory-data-accelerator. We cross post this article with the author’s permission. Introduction Discontinuity in big data infrastructure drives storage disaggregation, especially in companies experiencing dramatic data growth after pivoting to AI and analytics. This data growth challenge makes disaggregating … Continued

Distributed Data Querying with Alluxio

This is a guest blog by Jowanza Joseph with its original blog source. It is about how I used Alluxio to reduce p99 and p50 query latencies and optimized the overall platform costs for a distributed querying application. I walk through the product and architecture decisions that lead to our final architecture, discuss the tradeoffs, … Continued