Introducing Wormhole Dockerized Presto Alluxio setups for blazing fast analytics

October 29, 2019

This is a guest blog by Ashwin Sinha with an original blog source.

This blog introduces Wormhole— open source Dockerized solution for deploying Presto & Alluxio clusters for blazing fast analytics on file system (we use S3, GCS, OSS). When it comes to analytics, generally people are hands-on in writing SQL queries and love to analyse data which resides in a warehouse (e.g. MySQL database). But as data grows, these stores start failing and hence arises a need for getting the faster results in same or less time frame. This can be solved by distributed computing and Presto is designed for that. When attached with Alluxio, it works even more faster. That’s what Wormhole is all about.

Here is the high level architecture diagram of solution:

Let us explain each component in the order which they should be setup in wormhole:

Consul — Consul is a service networking solution to connect and secure services across any runtime platform and public or private cloud. It helps in identification of a container by its containerID to other containers. Setup instructions here.
Docker — Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. In our setup, we have put every service as a Docker container. Setup instructions here.
Alluxio Master — Alluxio masters should be deployed in High Availability(HA). These are responsible for making query execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
Alluxio Worker — Alluxio workers are the actual cache storage of Alluxio. When data is queried from Alluxio filesystem(FS), it fetches data from underlying FS(maybe S3, GCS or any configured FS) and stores in RAM in LRU fashion. Next time it doesn’t need to go till FS and returns data blocks from its own cache. Setup instructions here.
Hive metastore — Hive metastore is the collection of metadata of all the hive tables. In our setup, we create metastore on MySQL for storing Alluxio located data. Setup instructions here.
Presto Coordinator — Presto coordinator should be deployed in High Availability(HA). These are responsible for making jiquery execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
Presto Worker — Presto workers are responsible for crunching the data from underlying Alluxio. These pull the data, perform operations on it and send the individual results to coordinator. Setup instructions here.

Apart from above components, we require a Zookeeper quorum setup which is required for making Alluxio master and Presto coordinator highly available(HA). For complete documentation on setup, please refer here.

Time for some action

Now since we have setup Presto on top of Alluxio, but how to make it available for everyone to use? So the answer can be some other tools like Metabase, which provide connectivity to Presto. Just we needed to add the appropriate configurations and it works like a charm for all sorts of analysis.

Presto and Alluxio also provides UI to track the current state of things and that helps a lot.

Next Plans

Next focus will be mostly on making the solution self-serve through a user interface and make it self scalable (possibility of deployment on Kubernetes).

ABOUT THE AUTHOR

An engineer at heart and Data Engineer by profession, who loves to learn and play with distributed systems. Open-source enthusiast and loves to contribute back to community. Can refer to the work at github and medium. Also a traveller who loves to explore distant places and enjoy long bike rides.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo