hdfs Archives | Page 2 of 10

Adopting Satellite Clusters with Alluxio at Vipshop to Improve Spark Jobs for Targeted Advertising by 30x

July 25, 2020 By Gang Deng (Vipshop) and Jasmine Wang

As the third largest e-commerce site in China, Vipshop processes large amounts of data collected daily to generate targeted advertisements for its consumers. In this article, Gang Deng from Vipshop describes how to meet SLAs by improving struggling Spark jobs on HDFS by up to 30x, and optimize hot data access with Alluxio to create … Continued

How to Build a new Under Filesystem in Alluxio: Apache Ozone as an Example

July 7, 2020

In Alluxio, an Under File System is the plugin to connect to any file systems or object stores, so users can mount different storages like AWS S3 or HDFS into Alluxio namespace. This under filesystem is designed to be modular, in order to enable users to easily extend this framework with their own Under File System implementation and connect to a new or customized storage system.

Tags: apache ozone, aws s3, hdfs, meetup, object stores, storage, under filesystem

Bursting Spark or Presto Jobs to AWS using Alluxio

June 23, 2020

In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.

Tags: aws, cloud storage, compute, hdfs, hybrid cloud, office hour, performance, presto, spark, zero copy bursting

Accelerating Analytics by 200% with Impala, Alluxio, and HDFS at Tencent

June 22, 2020 By Honghan Tian (Tencent)

This article describes how engineers in the Data Service Center at Tencent PCG leverages Alluxio to optimize the analytics performance by 200% and minimize the operating cost in building Tencent Beacon Growing, a real-time data analytics platform.

Deep Learning at Alibaba Cloud with Alluxio – Running PyTorch on HDFS

June 19, 2020 By Yang Che (Alibaba)

Google’s TensorFlow and Facebook’s PyTorch are two Deep Learning frameworks that have been popular with the open source community. Although PyTorch is still a relatively new framework, many developers have successfully adopted it due to its ease of use. By default, PyTorch does not support Deep Learning model training directly in HDFS, which brings challenges … Continued

How to Build a new Under Filesystem in Alluxio: Apache Ozone as an Example

Alluxio Global Online Meetup * June 30, 2020

Bursting Spark or Presto Jobs to AWS using Alluxio

Community Online Office Hour * June 23, 2020

Burst Presto & Spark workloads to AWS EMR with no data copies

April 28, 2020

In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.

Tags: analytic workloads, cloud, hdfs, hybrid cloud, office hour, presto, public cloud, spark

“Zero-Copy” Hybrid Cloud for Data Analytics – Strategy, Architecture and Benchmark Report

April 6, 2020

This whitepaper details how to leverage a public cloud, such as Amazon AWS, Google GCP, or Microsoft Azure to scale analytic workloads directly on data on-premises without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Presto and Hive with Alluxio in the public cloud using on-prem HDFS. We will also show how to set up and execute performance benchmarks in two geographically dispersed Amazon EMR clusters along with a summary of our findings.

Tags: aws, azure, data analytics, emr, gcp, hdfs, hive, hybrid cloud, presto, public cloud, zero copy

Tag: hdfs