data Archives | Page 3 of 12

Getting Started with the Alluxio-Presto Sandbox

July 11, 2019 By Zac Blanco

The Alluxio-Presto sandbox is a docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.

Accelerating Analytical Workloads for Public & Hybrid Clouds

New York Meetup * July 10, 2019

In this meetup, Dipti and HY will present a new approach to hybrid analytical workloads using Alluxio, an open source data orchestration layer, which sits between compute and storage layer. Applications like Apache Spark or TensorFlow can then seamlessly access multiple disparate data sources with consistent performance using data locality and abstraction that the data orchestration tier brings.

Turn cloud storage or HDFS into your local file system for faster AI model training with TensorFlow

July 3, 2019 By Lu Qiu and Bin Fan

This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.

O’Reilly AI Conference Keynote: Data Orchestration for AI, Big Data, and Cloud

June 28, 2019

Haoyuan Li’s keynote at O’Reilly Beijing discusses open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.

Tags: big data, cloud, cloud object storage, cloud storage, compute storage separation, conference, data, data orchestration, hybrid cloud, multi cloud, on-prem object storage, storage

Tech Talk: Accelerate Spark Workloads on S3

June 28, 2019

While running analytics workloads using EMR Spark on S3 is a common deployment today, many organizations face issues in performance and consistency. EMR can be bottlenecked when reading large amounts of data from S3, and sharing data across multiple stages of a pipeline can be difficult as S3 is eventually consistent for read-your-own-write scenarios.

A simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.

Tags: aws, cloud, compute storage separation, data, data orchestration, emr, hybrid cloud, on-prem object storage, spark, tech talk

Community Office Hour: Running Spark & Alluxio in Kubernetes

June 25, 2019 by Bin Fan & Adit Madan

The data orchestration layer bridging the gap between data locality with improved performance and data accessibility for analytics workloads in Kubernetes, and enables portability across storage providers.
An overview of Alluxio and the cloud use case with Spark in Kubernetes. Learn how to set up Alluxio and Spark to run in Kubernetes.

Tags: analytics, apache spark, compute, compute storage separation, data, data orchestration, hybrid cloud, kubernetes, locality, multi cloud, office hour, spark, storage

Building Fast SQL Analytics with Presto, Alluxio, and S3

Alluxio Community Office Hour * July 30, 2019

Learn how to set up Presto with Alluxio such that Presto jobs can seamlessly read from and write to S3.
Compare the performance between Presto on S3 with Presto and Alluxio on S3.

Alluxio at Beijing Meetup

June 25, 2019

Haoyuan Li presents at Beijing Meetup on open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.

Tags: big data, cloud, cloud storage, compute storage separation, data, data orchestration, hybrid cloud, meetup, multi cloud, storage

Building fast and scalable big data and ML platforms at Pinterest and JD.com

June 21, 2019 by Calvin Jia & Yongsheng Wu [Pinterest]

This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.

Tags: aws s3, data, machine learning, meetup, metadata management, performance, scale, tiered storage

Tag: data