We’re pleased to announce the general availability of Alluxio Data Orchestration Hub, your single pane of glass to orchestrate data for analytics and AI. The data ecosystem is complex with the separation of storage and compute across data centers and cloud providers. With this release we’ve made great strides towards simplifying data access and management across multiple environments.
Data Orchestration Hub, or the Hub, is a management console that makes it easy to manage an analytics cluster and connect it with multiple data sources to unify data lakes. The service provides an easy to use unified management view for configuration and monitoring, and wizard based curation of deployment workflows.
- Connect Your Data Sources: Connect Alluxio to data storage and catalogs across multiple clouds, single cloud or on-premises using guided wizards.
- Monitor Your Alluxio Cluster: Monitor your Alluxio cluster.
- Manage Configuration: Set and distribute configuration for a cluster.
Alluxio Data Orchestration Hub is available immediately for all Alluxio deployment scenarios with compute engines like Presto, Spark and Tensorflow. The Hub is ready to use out of the box with Amazon EMR and Google Dataproc. Other platforms are also available for use. Please visit the documentation here for more information to try out the Hub.
When to Use
Connecting to data sources across regions
The Hub provides self-guided wizards to allow users to connect to data sources and catalogs in the same or remote data centers. A user is guided through the required configuration steps along with validation of the connection.
These wizards are applicable for multiple scenarios including: hybrid cloud, cross-data center, single cloud or private data center deployments. Manage your compute clusters with Alluxio using these easy-to-use wizards.

Managing an Alluxio cluster
The Hub can be used to view a dashboard to monitor the state of processes on the cluster, as well as update configuration and restart processes. This is especially useful for cloud deployments without access to SSH for configuration and process management.

What’s Next
To start using Alluxio Data Orchestration Hub, simply launch Alluxio enabled clusters in your on-premises or cloud deployment. Further changes and monitoring of the cluster is managed can now be managed using the Hub:
- Process Management: Monitor status of each process part of the Alluxio cluster, and start / stop processes.
- Connect Data Storage: Connect Alluxio to your data sources, such as HDFS / S3 / GCS, across a hybrid cloud, single cloud or on-premises.
- Connect Data Catalog: Configure structured data catalogs for OLAP engines like Presto on Alluxio. Connect to existing catalog definitions to prevent re-definition of table metadata.
- Advanced Configuration: Customize your Alluxio cluster with advanced options for setting and distributing configuration from the central console.
If you would like more information on Data Orchestration Hub and the supported toolset please read the release notes.
Have questions? Come join the Community Slack Channel.
Read the Alluxio 2.4 release product blog to learn more about the expanded features and capabilities to advance analytics and AI in the cloud.
.png)
Blog

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:
- Time-consuming data preparation and data copy/movement
- Difficulty utilizing GPU resources efficiently
- High and growing storage costs
- Excessive operational overhead maintaining storage for localized data silos
To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.