spark Archives | Page 4 of 12

Burst Presto & Spark workloads to AWS EMR with no data copies

Community Online Office Hour * April 28, 2020

In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.

Bursting Apache Spark Workloads to the Cloud on Remote Data

Community Online Office Hour * March 10, 2020

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes Using Alluxio

CNCF Webinar * January 21, 2020

In the on-prem days, one key performance optimization for Apache Hadoop or Apache Spark workloads is to run tasks on nodes with local HDFS data. However, while adoption of the Cloud & Kubernetes makes scaling compute workloads exceptionally easy, HDFS is often not an option. Effectively accessing data from cloud-native storage services like AWS S3 or even on-premises HDFS becomes harder as data locality is lost.

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

December 19, 2019

This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even the storage service is outside or remote.

Tags: data locality, hdfs, kubernetes, office hour, spark

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

Alluxio Community Office Hour * December 17, 2019

One important performance optimization in Apache Spark is to schedule tasks on nodes with HDFS data nodes locally serving the task input data. However, more users are running Apache Spark natively on Kubernetes where HDFS is not an option. This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even the storage service is outside or remote.

Community Office Hour: Improving Memory Utilization of Spark Jobs Using Alluxio

November 26, 2019

Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions

Tags: caching, memory, office hour, spark

Improving Memory Utilization of Spark Jobs Using Alluxio

Alluxio Community Office Hour * November 26, 2019

This office hour shares a demo and compares two approaches, caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio

Tag: spark