To further optimize Spark on disaggregated cloud storage and to benefit from rapid provisioning, excellent scalability, easy management, and pay as you grow flexibility, we added an “In-Memory Data Acceleration” layer to support big data filesystem operation natively and better utilize memory to improve the performance.
We tested deploying Alluxio with five 200 GB Memory. All Alluxio tests are based on the disaggregated S3A Ceph cloud storage configuration, enabling us to see the exact performance improvement after adding the in-memory data acceleration.
The results showed that ;both configurations provide a significant performance improvement.
For batch queries, performance with Alluxio shows more than 1.42 times improvement compared with disaggregated S3A Ceph cloud storage and similar performance to a traditional on-premise configuration. For the I/O intensive workload on Terasort, performance with Alluxio shows more than a 3.5 times improvement. And when compared with traditional on-premise configuration, disaggregated S3A Ceph cloud storage with Alluxio shows a 1.4 times performance improvement in the Terasort test. For CPU intensive workload using K-Means, performance with Alluxio shows 1.4 times improvement while compared to traditional on-premise configuration and performance with Alluxio disaggregate S3A Ceph cloud storage still indicates 10% worse than traditional on-premise configuration.
So, from the above data, we can conclude that using Alluxio as the cache can eliminate the performance overhead of S3A and there is still a benefit when deploying big data on cloud storage. When the workload is I/O intensive, it is even more beneficial to adopt Alluxio as the cache.