Announcing the OEM partnership with Alluxio and Starburst Data, the company behind Presto, the fastest growing SQL query engine in a disaggregated world.
Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this.
If we add more Spark jobs across multiple clusters, you could have something like this.
This article walks through the journey of a startup HashData in Beijing to build a cloud-native high-performance MPP shared-everything architecture leveraging object storage as the data persistence layer and Alluxio as a data orchestration layer in the cloud.
we will illustrate how HDW leverages Alluxio as the data orchestration layer to eliminate the performance penalty introduced by object storage while benefiting from its scalability and cost-effectiveness.
Discontinuity in big data infrastructure drives storage disaggregation, especially in companies experiencing dramatic data growth after pivoting to AI and analytics. This data growth challenge makes disaggregating storage from compute attractive because the company can scale their storage capacity to match their data growth, independent of compute. This decoupled mode allows the separation of compute and storage, enabling users to rightsize hardware for each layer. Users can buy high-end CPU and memory configurations for the compute nodes, and storage nodes can be optimized for capacity.
This whitepaper is a continuation of Unlock Big Data Analytics Efficiency with Compute and Storage Disaggregation on Intel® Platforms
This is a guest blog by Jowanza Joseph with an original blog source. It is about how he used Alluxio to reduce p99 and p50 query latencies and optimized the overall platform costs for a distributed querying application. Jowanza walks through the product and architecture decisions that lead to our final architecture, discuss the tradeoffs, share some statistics on the improvements, and discuss future improvements to the system.
We are writing several engineering blogs describing the design and implementation of Alluxio master to address this scalability challenge. This is the first article focusing on metadata storage and service, particularly how to use RocksDB as an embedded persistent key-value store to encode and store the file system inode tree with high performance.
Alluxio serves its metadata from a single active master as the primary and potentially multiple standby master for high availability. The master handles all metadata requests and uses a write-ahead log to journal all changes so that we can recover from crashes. The log is typically written to shared storage like HDFS for persistence and availability. Standby masters read the write-ahead log to keep their own state up-to-date. If the primary master dies, one of the standbys can quickly take over for it.
At Alluxio, we believe that in order to fundamentally solve the data access challenges, the world needs a new layer – a data orchestration platform – between computation frameworks and storage systems.
Notice anything new about our websites? That’s right – we are super excited to launch our new website – Alluxio.io!
As we continue our focus on our open source community, one important item on our mind was to rebuild our website to provide better user experience for our community. To that end, you’ll see lots of changes in the Alluxio web experience.
Alluxio is a proud sponsor and exhibitor of Spark+AI Summit in San Francisco.
What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark – Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab – now RISElab).