Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions
Alluxio, an open source data orchestration technology, helping speed up Dataproc workloads by providing a distributed caching layer in the Dataproc Cluster.
This office hour shares a demo and compares two approaches, caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio
Learn how to set up Google Cloud Dataproc with Alluxio so jobs can seamlessly read from and write to Cloud Storage. See how to run Dataproc Spark against a remote HDFS cluster.
ODSC WEST 2019 Cloud storage brings great flexibility in management and cost-efficiency to data scientists, but also introduces new challenges related to data accessibility and data locality for machine learning applications. For instance, when the input data is stored in a remote cloud storage like AWS S3 or Azure blob storage, direct data access is … Continued
In the previous tutorial ”Getting Started with Spark Caching using Alluxio in 5 Minutes”, we demonstrated how to get started with Spark and Alluxio. To share more thoughts and experiments on how Alluxio enhances Spark workloads, this article focuses on how Alluxio helps to optimize the memory utilization of Spark applications. For users who are … Continued
Learn why leading companies are moving towards a decoupled compute and storage architecture, and the associated challenges and requirements. Hear about how Spark and Alluxio together can solve the challenges.
Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, Dipti will briefly introduce Alluxio, share the top 10 tips for performance tuning for real-world workloads, and demo Alluxio with Spark.
In this talk, we present: trends and challenges in the data ecosystem in cloud era; Data engineering in the cloud with data orchestration; Use cases of using tech stacks (Presto or Tensorflow) with Alluxio on S3.