How Can AI Platforms Adapt to Hybrid or Multi-Cloud Environments

May 20, 2024

Hope Wang

This article was originally published on Spiceworks. https://www.spiceworks.com/tech/artificial-intelligence/guest-article/adapting-ai-platform-to-hybrid-cloud/

This blog discusses the challenges of implementing AI platforms in hybrid and multi-cloud environments and shares examples of organizations that have prioritized security and optimized cost management using the data access layer.

In recent years, AI platforms have undergone significant transformations as GenAI and AI continue to transform businesses. Traditionally, AI platforms relied on tightly coupled computation and storage, where data and computation were co-located on the same infrastructure, also called data locality.

This approach worked well for small-scale AI projects, but scaling and managing these systems efficiently became challenging. The architecture of modern data and AI platforms has shifted to separating computation and storage for elasticity and scalability.

The migration of AI workloads to the cloud has been a significant trend in recent years. Cloud platforms offer AI services and tools, such as machine learning frameworks, pre-trained models, on-demand computing resources, and massive-scale object storage. These services enable organizations to quickly build and deploy AI applications without requiring extensive infrastructure investments.

As AI platforms scale further, the architecture must be extensible to the public or private cloud. As businesses expand their cloud footprint, they adopt multi-region, hybrid, and multi-cloud strategies to optimize performance, resilience, and cost. Multi-cloud has now become a strategic choice.

Why Hybrid or Multi-cloud?

Both technical and non-technical reasons drive the adoption of hybrid and multi-cloud strategies.

Hybrid and multi-cloud strategies allow organizations to leverage specialized services from different cloud providers and build robust AI solutions. These solutions help mitigate risks associated with service outages or pricing changes and ensure optimal performance and cost-efficiency by matching workload requirements with the most suitable infrastructure. Also, from the data locality perspective, it is necessary to place the business-critical application near the end-users to reduce latency. Furthermore, spreading AI workloads across multiple providers prevents vendor lock-in and increases organizations’ negotiation power in choosing cloud providers.

Regulatory compliance and data sovereignty are critical non-technical factors influencing the adoption of hybrid and multi-cloud strategies. Hybrid architectures allow organizations to control sensitive data while leveraging the cloud’s benefits, ensuring compliance with data protection regulations like GDPR or HIPAA. Multi-cloud strategies enable compliance with data sovereignty laws and improve data locality for organizations operating in multiple regions.

Mergers and acquisitions between two organizations that use different cloud service providers often prompt adopting a multi-cloud approach instead of immediately migrating one organization’s setup into the other cloud. This approach provides a more flexible and cost-effective option for managing disparate cloud environments.

Challenges Leveraging Hybrid and Multi-cloud

Hybrid and multi-cloud offer numerous advantages, such as increased flexibility, risk mitigation, and access to specialized services, but it also introduces new challenges.

One of the primary challenges in hybrid and multi-cloud environments is the latency introduced by remote data access. As AI workloads are distributed across different clouds and regions, data needs to be transferred between these locations, which can result in significant latency. This latency can impact the performance and responsiveness of AI applications, particularly those that require real-time processing or low-latency interactions.

As more than remote access may be required for latency-sensitive workloads, copying data among data centers, clouds, and regions is another approach. Data movement and synchronization have become complex and time-consuming, with network latency, data transfer costs, and data consistency issues hindering the performance and efficiency of AI workflows. Managing costs across multiple cloud providers can be challenging due to different pricing models and resource allocation mechanisms. Hidden expenses, such as data transfer fees and idle resources, can quickly escalate if not carefully monitored and optimized.

GPUs are critical accelerator technologies for AI workloads, providing the computational power needed for training and inference tasks. However, GPU time is expensive, and maximizing GPU utilization and reducing any wait time stemming from data access is essential. The challenge lies in continuously feeding GPUs with data to avoid idle computation.

4 Best Practices for Hybrid or Multi-cloud AI Platforms

First, adopting cloud-agnostic architectures, such as containerization and serverless computing, can enhance portability and interoperability across different cloud environments. This approach decouples applications from the underlying infrastructure, enabling seamless migration and deployment across multiple clouds.

Second, deploying a data access layer between computation and storage provides a unified and efficient data access interface across multiple clouds and regions, minimizing data movement and optimizing data locality for improved performance.

Additionally, implementing a comprehensive security and compliance framework that considers each cloud provider’s unique requirements and policies should be considered. This may involve leveraging cloud-native security services, implementing encryption and access control mechanisms, and continuously monitoring and auditing for compliance violations.

Finally, monitoring resource utilization patterns and leveraging cloud-native tools can automate resource scaling and provide cost optimization. Consider implementing multi-cloud cost management tools to gain visibility and control costs across different cloud providers.

What Are Leading Organizations Doing?

Many organizations have successfully adopted hybrid and multi-cloud approaches for AI initiatives. Let’s examine the two examples of organizations strategizing their hybrid and multi-cloud AI platforms.

Walmart Global Tech recently published a blog sharing their experience deploying a machine learning platform across multiple clouds and regions. They highlighted the challenges businesses face when scaling AI solutions, such as vendor lock-ins, high license costs and fees, limited availability and reliability, and customization issues. Walmart emphasized that no single platform has all the answers, leading them to adopt the multi-cloud strategy for the AI platform.

Another example is Uber, whose engineering team shared their multi-cloud practices in a recent Data Infra Meetup event where they spoke about Uber’s data storage evolution story. Uber leverages two cloud vendors to build multi-cloud data lakes for AI, optimize ingress/egress costs, and manage storage costs effectively. They also emphasize the importance of a unified layer for data orchestration and caching to ensure seamless integration and performance across multiple cloud environments.

Harnessing Hybrid Clouds

Adapting AI platforms to hybrid or multi-cloud environments presents challenges and opportunities for organizations. Organizations can unlock the potential of leveraging multiple cloud providers by embracing containerization, leveraging the data access layer, prioritizing security, and optimizing cost management. Ultimately, a well-executed hybrid or multi-cloud strategy can empower organizations to leverage the strengths of different cloud providers, fostering innovation, agility, and competitive advantage in the AI revolution.

ML needs.

Check out the following resources:

Download the trial edition of Alluxio Enterprise AI now: https://www.alluxio.io/download/
Watch the 3-minute product demo of solving the data loading challenge for machine learning with Alluxio: https://www.alluxio.io/resources/product-demo/solving-the-data-loading-challenge-for-machine-learning-with-alluxio/
See how the FinTech giant serving 1.3 billion users speeds up large-scale computer vision training on billions of small files: https://www.alluxio.io/blog/optimizing-alluxio-for-efficient-large-scale-training-on-billions-of-files/
Gain a comprehensive understanding of I/O patterns in each stage of the machine learning pipeline and the solutions that can be used in architecting your data and AI platform: https://www.alluxio.io/resources/whitepapers/efficient-data-access-strategies-for-large-scale-ai/
Join the latest events and slack community with 8000+ data & AI infra experts: https://linktr.ee/Alluxio

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo