hdfs Archives | Page 3 of 10

Burst Presto & Spark workloads to AWS EMR with no data copies

Community Online Office Hour * April 28, 2020

In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.

Testing Distributed System at Scale for the Cost of a Large Pizza on AWS

Community Online Office Hour * February 25, 2020

Building distributed systems is no small feat. Software testing is just one of many critical practices that engineers who build these systems need to utilize to ensure the quality and usability of their software. For distributed systems, scaling out testing frameworks to ensure that enterprises who run our in highly distributed environments is a complicated (and expensive task!)

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

December 19, 2019

This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even the storage service is outside or remote.

Tags: data locality, hdfs, kubernetes, office hour, spark

Speeding Up I/O for Machine Learning

Alluxio Global Online Meetup * January 15, 2020

This talk will guide the audience on how Alluxio can greatly simplify the data preparation phase in with remote and possibly multiple data sources. We will share the lessons and benchmark from Bill Zhao an engineer led in Apple when building a Machine Learning platform using Tensorflow, NFS, DC/OS and Alluxio.

Tech Talk: Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud

December 12, 2019

Google Cloud Dataproc is a widely used fully managed Spark and Hadoop service to run big data analytics and compute workloads in the cloud. Services like Dataproc reduce hardware spend, eliminate the need to overbuy capacity, and provide business agility. Yet users still face challenges for performance sensitive workloads or workloads running on remote data.

Alluxio is an open source cloud data orchestration platform that increases performance of analytic workloads running on Dataproc by intelligently caching data and bringing back lost data locality. Alluxio also enables users to run compute workloads against on-prem storage like Hadoop HDFS without any app changes.

Chris Crosbie and Roderick Yao from the Google Dataproc team and Dipti Borkar of Alluxio demo how to set up Google Cloud Dataproc with Alluxio so jobs can seamlessly read from and write to Cloud Storage. They also show how to run Dataproc Spark against a remote HDFS cluster.

Tags: dataproc, google cloud, hdfs, tech talk

Tech Talk: The Path to Migrating off MapR

December 11, 2019

If you’re a MapR user, you might have concerns with your existing data stack. Whether it’s the complexity of Hadoop, financial instability and no future MapR product roadmap, or no flexibility when it comes to co-locating storage and compute, MapR may no longer be working for you.

Alluxio can help you migrate to a modern, disaggregated data stack using any object store with the similar performance of Hadoop plus significant cost savings.

Join us for this tech talk where we’ll discuss how to separate your compute and storage on-prem and architect a new data stack that makes your object store the core. We’ll show you how to offload your MapR/HDFS compute to any object store and how to run all of your existing jobs as-is on Alluxio + object store.

Tags: hdfs, mapr, object stores, tech talk

Online Meetup: Powering Data Science and AI with Apache Spark, Alluxio, and IBM

October 29, 2019

Learn why leading companies are moving towards a decoupled compute and storage architecture, and the associated challenges and requirements. Hear about how Spark and Alluxio together can solve the challenges.

Tags: analytics, compute storage separation, hdfs, meetup, performance, spark, use case

Tag: hdfs