Resource Hub

Presentation

Presentation

Online Meetup: AWS S3 + Alluxio + Presto = ❤️ The Ryte Use Case

At Ryte, we analyze unstructured, semi-structured and structured data for more than one million users worldwide. The whole Ryte-Platform is built with a scalable architecture to support our heavy load and make it possible for our customers to drill-down from a high-level overview into the last byte of their websites.

In this presentation, I will show why & how we solve some challenging technical issues, improve the speed, and reduce costs of our AWS EMR Hadoop & Presto -Backend with Alluxio to an awesome level!

Topics:

What is Ryte: Platform to optimize your Online-Marketing
Requirements for the Ryte-Platform
Why we use Presto on AWS EMR with S3
When problems pop-up
How we solve them with Alluxio in a perfect way

Blog

Blog

QA with Alluxios Bin Fan on Data Orchestration Cloud Migration and Data Engineering Challenges

For today’s blog post I interviewed Bin Fan, Founding Engineer and VP of Open Source at Alluxio. Bin is the PMC maintainer of the Alluxio open source project. Prior to Alluxio, he worked for Google on the next-generation storage infrastructure.

Presentation

Presentation

Alluxio – Data Orchestration for Analytics and AI in the Cloud

Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More

Data storage is migrating from the colocated model (e.g., HDFS) to a more cost-effective, scalable but often fully disaggregated and remote data lake model (e.g. S3). This has created a strong need for data orchestration in the cloud like what K8s does for container-based workloads, so that data can be presented in the right layout at right location for data applications on the cloud. Originally developed from UC Berkeley AMPLab project “Tachyon”, Alluxio (www.alluxio.io) implements the world’s first open-source data orchestration system in the cloud: an unified access layer for data-driven applications in bigdata and ML, enabling Spark, Presto or TensorFlow to transparently access different external storage systems while actively leveraging in-memory cache to accelerate data access. In this talk, we will present: trends and challenges in the data ecosystem in cloud era; Data engineering in the cloud with data orchestration; Use cases of using tech stacks (Presto or Tensorflow) with Alluxio on S3

‍

Blog

Blog

Getting Started with EMR Hive on Alluxio in 10 Minutes

This tutorial describes steps to set up an EMR cluster with Alluxio as a distributed caching layer for Hive, and run sample queries to access data in S3 through Alluxio.

On Demand Videos

On Demand Videos

Community Office Hour: Accelerating Hive with Alluxio on S3

ALLUXIO COMMUNITY OFFICE HOUR

On Demand Videos

On Demand Videos

Data Orchestration for AI, Big Data, and Cloud

IFA+ SUMMIT 2019

On Demand Videos

On Demand Videos

Online Meetup: Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3

Blog

Blog

Effective Analytical Pipelines on AWS Using EMR Alluxio and S3

This article describes my lessons from a previous project which moved a data pipeline originally running on a Hadoop cluster managed by my team, to AWS using EMR and S3. The goal was to leverage the elasticity of EMR to offload the operational work, as well as make S3 a data lake where different teams can easily share data across projects.

Blog

Blog

Building a Largescale Interactive SQL Query Engine using Presto and Alluxio in JDcom

This article describes how JD built this interactive OLAP platform combining two open-source technologies: Presto and Alluxio.

White Paper

White Paper

Why Data Orchestration?

Blog

Blog

Implementing a Secure Plugandplay Distributed File System Service Using Alluxio in Baidu

In this article, you will learn how to incorporate Alluxio to implement a unified distributed file system service as well as how to add extensions on top of Alluxio including customized authentication schemes and UDF (user-defined functions) on Alluxio files.

On Demand Videos

On Demand Videos

Tech Talk: Accelerating analytics with EMR on your S3 data lake

Presentation

Presentation

360 & Alluxio Joint Meetup: Distributed Storage and Alluxio Application

360 & ALLUXIO JOINT MEETUP

Using Alluxio POSIX (FUSE) API in JD.com

Alluxio FUSE landing in Jingdong
Deep analysis of Alluxio FUSE principle and architecture
How to improve POSIX compatibility of Alluxio FUSE
JD’s contribution to the Alluxio community

On Demand Videos

On Demand Videos

Community Office Hour: Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

ALLUXIO COMMUNITY OFFICE HOUR

Presentation

Presentation

Bay Area Meetup: Interactive Analytics in the Cloud with Presto and Alluxio

ALLUXIO BAY AREA MEETUP

This talk describes a stack to combine Presto, Alluxio, and Cloud object storage systems (e.g.,AWS S3) for high-concurrent and low-latency SQL queries on big data on the cloud. Presto, an open-source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Alluxio is an open-source data orchestration that brings data closer to compute and provides a unified data access layer at in-memory speeds. Presto can use Alluxio as a distributed caching tier on top of S3 for the hot data to query, avoiding reading data repeatedly from the cloud.

This talk covers:

The architecture of Presto, its separation of compute and storage, cloud-readiness, recent advancements in the project such as Cost-Based Optimizer and Kubernetes Support.
An overview of Alluxio’s key concepts, architecture and data flow,
Presto and Alluxio production use cases running hundreds of nodes, including ING Bank, JD.com, and NetEase Games.

Presentation

Presentation

Austin Meetup: Efficient Data Engineering with Apache Spark, Hive, and Alluxio on S3

Cloud, Data, & Orchestration – Austin Meetup

At Bazaarvoice, a software-as-a-service digital marketing company, the data engineering team is tasked to handle data at massive Internet-scale to serve over 1,900 of the biggest internet retailers and brands.

We built our data pipelines all in the cloud using Apache Spark and Hive on AWS EC2 accessing data in S3. AWS enables us to scale “out” the infrastructure capacity effortlessly to keep up with the Internet-scale data and web traffic, but scaling out also exposes certain limitations like the ability to further scale “up”. While this cloud native stack is scalable and elastic we experience performance limitations, because data access is limited by the network bandwidth, and this is exacerbated for workloads that involve repeated queries.

To address the data access challenges, we leverage Alluxio, an open source data orchestration system for analytics in the cloud. Alluxio serves as a transparent caching layer for hot and warm data, such that Hive and Spark jobs are able to access all data transparently in S3. We have seen 10x performance acceleration of Spark and Hive jobs on S3 with Alluxio.

Blog

Blog

Four Different Ways to Write to Alluxio

Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.

On Demand Videos

On Demand Videos

Tech Talk: Accelerating Spark with Kubernetes

Blog

Blog

Creating Grafana Dashboards to Visualize Alluxio Metrics

Monitoring metrics is highly important to operate distributed systems in production. Alluxio collects metrics using the Codahale Metrics Library on I/O throughput, RPC throughput, and resource usage. Alluxio metrics are shown in its webUI, but are also available through a REST endpoint or exportable to several third-party sinks in a time-series manner (see docs).

On Demand Videos

On Demand Videos

Tech Talk: Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Blog

Blog

Accelerating Writeintensive Data Workloads on AWS S3

Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

On Demand Videos

On Demand Videos

Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark

ALLUXIO BAY AREA MEETUP

‍

Blog

Blog

Recap AWS Summit New York

Alluxio is a proud sponsor and exhibitor at the AWS Summit in New York. If you weren't able to attend, here are the highlights

Presentation

Presentation

Scalable Filesystem Metadata Services with RocksDB

Alluxio maintainer and founding engineer Calvin Jia presents on Scalable Filesystem Metadata Services with RocksDB at the RocksDB meetup at Twitter.

Alluxio provides a unified namespace where you can mount multiple different storage systems and access them through the same API. To serve the file system requests to operate on all the files and directories in this namespace, Alluxio masters must handle the file system metadata at a scale of all mounted systems combined. We are writing several engineering blogs describing the design and implementation of Alluxio master to address this scalability challenge. This is the first article focusing on metadata storage and service, particularly how to use RocksDB as an embedded persistent key-value store to encode and store the file system inode tree with high performance.

Your selections don't match any items.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

Resource Hub

Using Alluxio POSIX (FUSE) API in JD.com

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer