Hear about Bazaarvoice’s use case leveraging Apache Spark, Hive, and Alluxio on S3. And learn how to set up Hive with Alluxio so that Hive jobs can seamlessly read/write to S3.
Learn how to set up EMR Spark with Alluxio so Spark jobs can seamlessly read from and write to S3. See the performance comparison between Spark on S3 with Spark, and Alluxio on S3.
Learn how to set up Presto with Alluxio such that Presto jobs can seamlessly read from and write to S3.
Compare the performance between Presto on S3 with Presto and Alluxio on S3.
The latest advances in container orchestration by Kubernetes bring cost savings and flexibility to compute workloads in public or hybrid cloud environments. On the other hand, it introduces new challenges such as how to move data to compute efficiently, how to unify data across multiple or remote clouds, how to co-locate data with compute and many more. Alluxio approaches these problems in a new way. It helps elastic compute workloads realize the true benefits of the cloud, while bringing data locality and data accessibility to workloads orchestrated by Kubernetes
Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
Alluxio can help data scientists and data engineers interact with different storage systems in a hybrid cloud environment. Using Alluxio as a data access layer for Big Data and Machine Learning applications, data processing pipelines can improve efficiency without explicit data ETL steps and the resulting data duplication across storage systems.
Join us for our first monthly office hour. This month we will focus on:
Installing Alluxio using Docker and Homebrew on your local Linux/Mac machine and accessing data from S3 and HDFS, Understanding Alluxio’s architecture in the data ecosystem, Open Session for discussion on any topics such as solving the separation of compute and storage problem, unifying multiple storage systems, and more.
In this Office Hour you’ll learn about:
Using Alluxio as the input/output for Spark applications, Saving and loading Spark RDDs and Dataframes with Alluxio, Open Session for discussion on any topics such as solving the separation of compute and storage problem, unifying multiple storage systems, and more
The Alluxio POSIX API enables data engineers to access any distributed file system or cloud storage as if accessing a local file system with an added performance improvement. This reduces the effort and complexity for data engineers to run their machine learning or legacy workloads on new data storage without data migration or data duplication.