“Zero-Copy” Hybrid Bursting with no App Changes

This whitepaper details how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Starburst Presto, Spark, and Hive with Alluxio in the public cloud using on-prem HDFS.

The paper also includes a real world case study on a leading hedge fund based in New York City, who deployed large clusters of Google Compute Engine VMs with Spark and Alluxio using on-prem HDFS as the underlying storage tier.

Tags: , , , , ,

O’Reilly AI Conference Keynote: Data Orchestration for AI, Big Data, and Cloud

Haoyuan Li’s keynote at O’Reilly Beijing discusses open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.

Tags: , , , , , , , , , , ,

Tech Talk: Accelerate Spark Workloads on S3

While running analytics workloads using EMR Spark on S3 is a common deployment today, many organizations face issues in performance and consistency. EMR can be bottlenecked when reading large amounts of data from S3, and sharing data across multiple stages of a pipeline can be difficult as S3 is eventually consistent for read-your-own-write scenarios.  

A simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.

Tags: , , , , , , , , ,

Community Office Hour: Running Spark & Alluxio in Kubernetes

The data orchestration layer bridging the gap between data locality with improved performance and data accessibility for analytics workloads in Kubernetes, and enables portability across storage providers.
An overview of Alluxio and the cloud use case with Spark in Kubernetes. Learn how to set up Alluxio and Spark to run in Kubernetes.

Tags: , , , , , , , , , , , ,

Alluxio at Beijing Meetup

Haoyuan Li presents at Beijing Meetup on open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.

Tags: , , , , , , , , ,

Hybrid Environments for Data Analytics is a Possibility

As the data ecosystem becomes massively complex and more and more disaggregated, data analysts and end users have trouble adapting and working with hybrid environments. The proliferation of compute applications along with storage mediums leads to a hybrid model that we are just not accustomed to.
With this disaggregated system data engineers now come across a multitude of problems that they must overcome in order to get meaningful insights.

Effective Data Engineering in the Cloud World

Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premise only dealt with databases and some parts of the hadoop stack.
In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full stack engineer (leaving out the final end application perhaps). Compute, containers, storage, data movement, performance, network — skills are increasing needed across the broader stack. Here are some design concept and data stack elements to keep in mind.

RocksDB Meetup at Twitter

Bay Area Meetup *

Twitter SF is hosting 2019’s half yearly RocksDB Meetup with speakers from Twitter, Facebook and the community on July 11th.