apache hadoop Archives

Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark

July 23, 2019

This meetup presents an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release, and Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop.

Tags: alluxio engineering, apache hadoop, apache spark, compute, compute storage separation, data, data orchestration, hadoop, hdfs, meetup, scale, spark, storage

How do you offload workloads from Hadoop?

What is Apache Hadoop If you’re new to building big data applications, Apache Hadoop is a distributed framework for managing data processing and storage for big data applications running in clustered systems. It consists of 5 modules – a distributed file system (aka HDFS or Hadoop Distributed File System), MapReduce for parallel processing of datasets, … Continued

Alluxio: Unifying APIs, Accelerating ML, & Enabling Cloud Architectures

Bay Area Meetup * September 14, 2016

Using intermediate APIs means developers can learn just one framework and still access features offered by different technologies. It means writing job logic only once and being able to test it easily on a new underlying service with no effort. Not only is modularity a win for users but it means creators of execution frameworks and storage systems can focus on performance and capability without having to worry about API maintenance.

Using Alluxio to Improve Spark & Hadoop HDFS System Performance and Reliability [Chinese]

Hadoop Summit China 2017 * March 15, 2017

Using Alluxio to Improve Spark & Hadoop HDFS System Performance and Reliability [Chinese]

How to Use Alluxio to improve Spark and Hadoop HDFS Performance of Data Access and System Reliability [Chinese]

Database Technology Conference China 2017 * May 9, 2017

China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements. The complicated the architecture of its incumbent computing platform created a lot of new challenges to effectively use resources.

Alluxio Developer Tip: Why am I seeing the error “User yarn is not configured for any impersonation. impersonationUser: foo?”

January 22, 2019 By Gene Pang

Impersonation is simply the ability for one user to act on behalf of another user. For example, say user ‘yarn’ has the credentials to connect to a service, but user ‘foo’ does not. Therefore, user ‘foo’ would never be able to access the service. However, user ‘yarn’ can access the service and impersonate (act on behalf of) user ‘foo’, allowing access to user ‘foo’. Therefore, impersonation enables one user to access a service on behalf of another user.
The impersonation feature defines how users can act on behalf of other users. Therefore, it is important to know who the users are.

Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com’s Computation Frameworks

September 14, 2018 by Bing Bai & Tao Huang [JD.com]

Strata NY 2018 – Learn how to use Alluxio as a pluggable optimization component. Understand how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing while ensuring consistency between Alluxio and HDFS.

Tags: apache hadoop, benchmark, case study, compute storage separation, hdfs, presto

Tencent Case Study: Delivering Customized News to Over 100 Million Users per Month with Alluxio

April 8, 2018 By Can He (Tencent)

Tencent is one of the largest technology companies in the world and a leader in multiple sectors such as social networking, gaming, e-commerce, mobile and web portal. Tencent News, one of Tencent’s many offerings, strives to create a rich, timely news application to provide users with an efficient, high-quality reading experience. To provide the best experience to more than 100 million monthly active users of Tencent News, we leverage Alluxio with Apache Spark to create a scalable, robust, and performant architecture.

MOMO: Accelerating Ad Hoc Analysis with Spark SQL and Alluxio

March 20, 2018 By MOMO Team

Alluxio clusters act as a data access accelerator for remote data in connected storage systems. Temporarily storing data in memory, or other media near compute, accelerates access and provides local performance from remote storage. This capability is even more critical with the movement of compute applications to the cloud and data being located in object stores separate from compute. Caching is transparent to users, using read/write buffering to maintain continuity with persistent storage. Intelligent cache management utilizes configurable policies for efficient data placement and supports tiered storage for both memory and disk (SSD/HDD).

Tag: apache hadoop