Alluxio 2.4.0 focuses on features critical to large scale, production deployments in Cloud and Hybrid Cloud environments. Features such as highly scalable metadata journaling, aggregate cluster metrics monitoring, and automated detection of JVM pauses further improve Alluxio’s suitability for demanding workloads.
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome.
In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.
This is a casual online video chat where all attendees are welcome to bring your own questions. Our host Bin will have suggested topics, such as the top challenges around leveraging popular compute frameworks including Presto and Spark to access remote data, and the latest developments in Alluxio open source such as Alluxio Catalog Services.
In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.
It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed. This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges.
Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.
Alluxio, as a data orchestration layer provides the physical data independence, for Presto to interact with the data more efficiently. In addition to caching for IO acceleration, Alluxio also provides a catalog service to abstract the metadata in the Hive Metastore, and transformations to expose the data in compute-optimized way. In this talk, we describe some of the challenges of using Presto with Hive, and introduce Alluxio data orchestration for solving those challenges.