Getting Started with EMR Hive on Alluxio in 10 Minutes
This tutorial describes steps to set up an EMR cluster with Alluxio as a distributed caching layer for Hive, and run sample queries to access data in S3 through Alluxio.
This tutorial describes steps to set up an EMR cluster with Alluxio as a distributed caching layer for Hive, and run sample queries to access data in S3 through Alluxio.
This article describes my lessons from a previous project which moved a data pipeline originally running on a Hadoop cluster managed by my team, to AWS using EMR and S3. The goal was to leverage the elasticity of EMR to offload the operational work, as well as make S3 a data lake where different teams can easily share data across projects.
This article describes how JD built this interactive OLAP platform combining two open-source technologies: Presto and Alluxio.
In this article, you will learn how to incorporate Alluxio to implement a unified distributed file system service as well as how to add extensions on top of Alluxio including customized authentication schemes and UDF (user-defined functions) on Alluxio files.
I recently worked on a PoC evaluating Nomad for a client. Since there were certain constraints limiting what was possible on the client environment, I put together something “quick” on my personal workstation to see what was required for Alluxio to play nice with Nomad.
Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.
Grafana, a comprehensive metrics visualization software, ties into this process by pulling the metrics that systems like Alluxio collect through a sink and visualizes them in a more helpful fashion. This guide will cover how to set up Grafana and Graphite, a supported sink for Alluxio that will put metrics in a time-series database, along with exploring some of the possibilities that the combination offers.
Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.
Alluxio is a proud sponsor and exhibitor at the AWS Summit in New York. If you weren’t able to attend, here are the highlights