Blog

Alluxio's strong Q2 featured Enterprise AI 3.7 launch with sub-millisecond latency (45× faster than S3 Standard), 50%+ customer growth including Salesforce and Geely, and MLPerf Storage v2.0 results showing 99%+ GPU utilization, positioning the company as a leader in maximizing AI infrastructure ROI.

In this blog, Greg Lindstrom, Vice President of ML Trading at Blackout Power Trading, an electricity trading firm in North American power markets, shares how they leverage Alluxio to power their offline feature store. This approach delivers multi-join query performance in the double-digit millisecond range, while maintaining the cost and durability benefits of Amazon S3 for persistent storage. As a result, they achieved a 22 to 37x reduction in large-join query latency for training and a 37 to 83x reduction in large-join query latency for inference.
.png)

China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements.
.jpeg)
In a recent blog, we discussed the ideation, design and new features in Alluxio 2.0 preview. Today we are thrilled to announce another new revolutionary project that the Alluxio engineering team has been hard at work on for the past year - the Alluxio Virtual Reality (VR) client.

Apache Spark has brought significant innovation to Big Data computing, but its results are even more extraordinary when paired with Alluxio. Alluxio, provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. Bazaarvoice uses the combination of Spark and Alluxio to provide a real time big data platform that has the ability to not only handle the intake of 1.5 billion page views during peak events like Black Friday, but also provide real time analytics against it (read more). At this scale, the gain in speed is an enabler for new workloads. We’ve established a clean and simple way to integrate Alluxio and Spark.
.jpeg)
Caching frequently used data in memory is not a new computing technique, however it is a concept that Alluxio has taken to the next level with the ability to aggregate data from multiple storage systems in a unified pool of memory.

We are thrilled and excited to announce the availability of Alluxio 2.0 Preview Release - the largest open source release with the most new features and improvements since the creation of the project. It is now available for download. While Alluxio already enabled data locality and data accessibility for many big data workloads in the cloud, there was still innovation needed in key areas.

Presto is an open source distributed SQL engine widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Alluxio is an open-source distributed file system that provides a unified data access layer at in-memory speed. The combination of Presto and Alluxio is getting more popular in many companies like JD, NetEase to leverage Alluxio as distributed caching tier on top of slow or remote storage for the hot data to query, avoiding reading data repeatedly from the cloud. In general, Presto doesn't include a distributed caching tier and Alluxio enables caching of files and objects that the Presto query engine needs.
.jpeg)
In this article, Thai Bui from Bazaarvoice describes how Bazaarvoice leverages Alluxio to build a tiered storage architecture with AWS S3 to maximize performance and minimize operating costs on running Big Data analytics on AWS EC2.
.jpeg)
The Alluxio sandbox is the easiest way to test drive the popular data analytics stack of Spark, Alluxio, and S3 deployed in a multi-node cluster in a public cloud environment. The sandbox cluster is fully configured and ready for users to run applications ranging from the hello-world example to the TPC-DS benchmark suite. Don’t take our word for it; kick off the benchmark yourself to see the performance benefits of running Spark jobs that interface through Alluxio on S3 compared to running Spark jobs directly on S3. It is extremely easy to request and launch a sandbox cluster as a playground for 24 hours at no cost to you.
.jpeg)
Impersonation is simply the ability for one user to act on behalf of another user. For example, say user ‘yarn’ has the credentials to connect to a service, but user ‘foo’ does not. Therefore, user ‘foo’ would never be able to access the service. However, user ‘yarn’ can access the service and impersonate (act on behalf of) user ‘foo’, allowing access to user ‘foo’. Therefore, impersonation enables one user to access a service on behalf of another user. The impersonation feature defines how users can act on behalf of other users. Therefore, it is important to know who the users are.
.jpeg)
Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.