benchmark Archives | Page 2 of 3

China Unicom Uses Alluxio and Spark to Build New Computing Platform to Serve Mobile Users

April 10, 2019 By Zhang Ce

China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements.

Achieving 10x acceleration of Spark and Hive Jobs on AWS S3 with Alluxio Tiered Storage

February 20, 2019

The data engineering team at Bazaarvoice, a software-as-a-service digital marketing company based in Austin, Texas, must handle data at massive Internet-scale to serve its customers. Facing challenges with scaling their storage capacity up and provisioning hardware, they turned to Alluxio’s tiered storage system and saw 10x acceleration of their Spark and Hive jobs running on AWS S3.

In this whitepaper you’ll learn:

How to build a big data analytics platform on AWS that includes technologies like Hive, Spark, Kafka, Storm, Cassandra, and more
How to setup a Hive metastore using a storage tier for hot tables
How to leverage tiered storage for maximized read performance

Tags: apache hive, apache spark, aws s3, benchmark, case study, performance, tiered storage

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

February 20, 2019 By Thai Bui

In this article, Thai Bui from Bazaarvoice describes how Bazaarvoice leverages Alluxio to build a tiered storage architecture with AWS S3 to maximize performance and minimize operating costs on running Big Data analytics on AWS EC2.

One Click to Benchmark Spark + Alluxio + S3 Stack with TPC-DS queries on AWS

February 12, 2019 By Rico Chiu

The Alluxio sandbox is the easiest way to test drive the popular data analytics stack of Spark, Alluxio, and S3 deployed in a multi-node cluster in a public cloud environment. The sandbox cluster is fully configured and ready for users to run applications ranging from the hello-world example to the TPC-DS benchmark suite. Don’t take our word for it; kick off the benchmark yourself to see the performance benefits of running Spark jobs that interface through Alluxio on S3 compared to running Spark jobs directly on S3. It is extremely easy to request and launch a sandbox cluster as a playground for 24 hours at no cost to you.

Presto on Alluxio: How Netease Games leveraged Alluxio to boost ad hoc SQL on HDFS

January 11, 2019 By Shuang Li

Netease Games is the operator for many popular online games in China like “World of Warcraft” and “Hearthstone”. Netease Games also has developed quite a few popular games on its own such as “Fantasy Westward Journey 2”, “Westward Journey 2”, “World 3”, “League of Immortals”. The strong growth of the business drives the demand to build and maintain a data platform handling a massive amount of data and delivering insights promptly from the data. Given our data scale, it is very challenging to support high-performance ad-hoc queries to the data with results generated in a timely manner.

How To Speed Up Alluxio Metadata Operations Up To 100X

October 16, 2018 By David Zhu

This blog describes our experience in speeding up Alluxio metadata operations using fingerprint and Alluxio under store bulk operations. These latest optimizations can be found in the 1.8.1 release.
One of the major values Alluxio provides is a simple and unified interface to manage files and directories on different underlying storage systems. Alluxio acts as an intermediate layer and exposes a file interface for applications to interact with, even though the underlying storage system might be an object store that has a different interface.

Intel: How to Use Alluxio to Accelerate Big Data Analytics on the Cloud and New Opportunities with Persistent Memory

October 1, 2018 by Yuan Zhou

Learn how Intel uses Alluxio to accelerate big data analytics in the cloud, as well as new opportunities with persistent memory with separated compute and storage.

Tags: apache spark, aws s3, benchmark, compute storage separation, partner

Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com’s Computation Frameworks

September 14, 2018 by Bing Bai & Tao Huang [JD.com]

Strata NY 2018 – Learn how to use Alluxio as a pluggable optimization component. Understand how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing while ensuring consistency between Alluxio and HDFS.

Tags: apache hadoop, benchmark, case study, compute storage separation, hdfs, presto

Effective caching for Spark RDDs with Alluxio

August 24, 2018 By Gene Pang and Pei Sun

Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their case study, they described how Alluxio improved their system architecture, and mentioned that some existing Spark jobs would slow down or would never finish because they would run out of memory. After using Alluxio, those jobs were able to finish, because the data could be stored in Alluxio, instead of within Spark.
In this blog, we show by saving RDDs in Alluxio, Alluxio can keep larger data sets in-memory for faster Spark applications, as well as enable sharing of RDDs across separate Spark applications.

Tag: benchmark