Blog

Alluxio Blog

Top Tips and Tricks for PyTorch Model Training Performance Tuning [2023]

Get the latest and greatest tips to accelerate your PyTorch model training for machine learning and deep learning. PyTorch, an open-source machine learning framework, has become the de facto choice for many organizations to develop and deploy deep learning models. Model training is the most compute-intensive phase of the machine learning pipeline. It requires continuous … Continued

Trino Optimization With Distributed Caching on Data Lakes: Trino Fest 2023 Session Recap

Originally published on trino.io: https://trino.io/blog/2023/07/21/trino-fest-2023-alluxio-recap.html By 2025, there will be 100 zetabytes stored in the cloud. That’s 100,000,000,000,000,000,000,000 bytes – a huge, eye-popping number. But only about 10% of that data is actually used on a regular basis. At Uber, for example, only 1% of their disk space is used for 50% of the data they access … Continued

What’s New in Alluxio Enterprise 2.10: Radically Resource-efficient for Improved Speed at Lower Cost

We are pleased to unveil the latest version of Alluxio. This new release represents a significant milestone to enhance system reliability under different kinds of resource limitations or stress scenarios, particularly to get the most out of limited hardware resources to scale at manageable costs. Enhanced Functionality: Dramatic Improvements in High Availability (HA): Mission-critical applications … Continued

Building High-performance Data Access Layer for Model Training and Model Serving for LLM

Bringing a large language model from its initial training to deployment requires numerous systems and components. At Zhihu, we grappled with a multi-cloud, cross-region AI platform, requiring an efficient solution to facilitate the rapid training and delivery of models for production use cases. This led us to adopt Alluxio, the high-performance data access layer for … Continued

Millions Saved Annually: Unleashing the Power of Alluxio + HDFS at Uber

In October 2022, Uber’s Presto team shared in a blog post using the Alluxio SDK cache to boost Presto query performance and cost efficiency. This achievement is a major milestone in the collaboration between Alluxio and Uber. Thus far, the Uber Presto team has implemented the Alluxio SDK cache in three production clusters spanning over … Continued

Announcing Our First AI 🤖 PMC Member: CacheGPT

We are thrilled to announce that CacheGPT, a state-of-the-art natural language generation model, has joined the Alluxio Project Management Committee (PMC) as our newest member!  CacheGPT has been an active contributor to Alluxio since the beginning of this year. It reviews pull requests and draft documentation using only emojis! See our new emoji-enriched documentation here! … Continued

Alipay: Optimizing Alluxio for Efficient Large-Scale Training on Billions of Files

Chuanying Chen, Senior Software Engineer at Ant Group, provides a deep dive into the practices of optimizing Alluxio for reliable, scalable, and high-performance large-scale training on billions of files. 1. Background Ant Group, formerly known as Ant Financial, is an affiliate company of the Chinese conglomerate Alibaba Group. The group owns the world’s largest mobile … Continued