To further optimize Spark on disaggregated cloud storage and to benefit from rapid provisioning, excellent scalability, easy management, and pay as you grow flexibility, we added an “In-Memory Data Acceleration” layer to support big data filesystem operation natively and better utilize memory to improve the performance.
We tested deploying Alluxio with five 200 GB Memory. All Alluxio tests are based on the disaggregated S3A Ceph cloud storage configuration, enabling us to see the exact performance improvement after adding the in-memory data acceleration.
The results showed that ;both configurations provide a significant performance improvement.
For batch queries, performance with Alluxio shows more than 1.42 times improvement compared with disaggregated S3A Ceph cloud storage and similar performance to a traditional on-premise configuration. For the I/O intensive workload on Terasort, performance with Alluxio shows more than a 3.5 times improvement. And when compared with traditional on-premise configuration, disaggregated S3A Ceph cloud storage with Alluxio shows a 1.4 times performance improvement in the Terasort test. For CPU intensive workload using K-Means, performance with Alluxio shows 1.4 times improvement while compared to traditional on-premise configuration and performance with Alluxio disaggregate S3A Ceph cloud storage still indicates 10% worse than traditional on-premise configuration.
So, from the above data, we can conclude that using Alluxio as the cache can eliminate the performance overhead of S3A and there is still a benefit when deploying big data on cloud storage. When the workload is I/O intensive, it is even more beneficial to adopt Alluxio as the cache.
Complete the form below to access the full overview:
Presentations
Use Alluxio to Unify Storage Systems in Suning
Suning is one of the leading commercial enterprises in China with two public companies in China and Japan respectively. It uses Alluxio to unify storage systems and manage multiple HDFS clusters.
STRATA DATA CONFERENCE LONDON 2018
JD.com is China’s largest online retailer and its biggest overall retailer, as well as the country’s biggest internet company by revenue. Currently, JD.com’s BDP platform runs more than 400,000 jobs (15+ PB) daily, on a system with more than 15,000 cluster nodes and a total capacity of 210 PB.
Alluxio, formerly Tachyon, is the world’s first system that unifies disparate storage systems at memory speed. In the big data ecosystem, Alluxio lies between computation frameworks or jobs and various kinds of storage systems. Additionally, Alluxio’s memory-centric architecture enables data access orders of magnitude faster than existing solutions.
Alluxio has run in JD.com’s production environment on 100 nodes for six months. Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFSURLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. This work has also extended Alluxio and enhanced the syncing between Alluxio and HDFS for consistency.
Alluxio in MOMO: Accelerating Ad Hoc Analysis
From our friends at MOMO
MOMO, a leading pan-entertainment social platform in China, has deployed Alluxio to accelerate ad-hoc query analytics. In the course of evaluating the best fit for Alluxio in their infrastructure they conducted several performance tests to understand how ad-hoc query analytics behaved in several scenarios. These tests give real-world insight to the performance benefits Alluxio provides. The MOMO findings include:
- With Alluxio, performance was improved 3-5x over the current mode
- Even when initially reading ‘cold’ data Alluxio delivered superior performance in most cases
- Alluxio can effectively scale-out to improve performance as requirements grow