Products
How to Develop and Operate Cloud Native Data Platforms and Applications
November 12, 2019
Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding systems were not designed in a cloud-native way.
In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude.
In the first generation of the data platform, we used HDFS as a cache layer between ETL jobs and the underlying AWS storage service S3. However, HDFS is not a special-purpose cache service, so custom code is needed to make it work like a cache. We have to run a backup workflow in every ETL job to backup data to S3 and sync the metadata store of the ETL jobs running on HDFS and that of interactive analytic queries running directly on S3. Moreover, we rely on complex and fragile mechanisms for purging datasets when the clusters are under heavy load. The use of HDFS also makes it a challenge to rapidly scale up the YARN cluster during peak hours and scale it down during off-hours. We are currently redesigning the data platform, mainly by replacing HDFS with a special-purpose data orchestration service called Alluxio. In our initial evaluation, Alluxio not only provides better performance than HDFS but also significantly simplifies the architecture of our data platform and makes it easy to scale up and down and paves the way to a cloud native ETL processing stack.
Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding systems were not designed in a cloud-native way.
In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude.
In the first generation of the data platform, we used HDFS as a cache layer between ETL jobs and the underlying AWS storage service S3. However, HDFS is not a special-purpose cache service, so custom code is needed to make it work like a cache. We have to run a backup workflow in every ETL job to backup data to S3 and sync the metadata store of the ETL jobs running on HDFS and that of interactive analytic queries running directly on S3. Moreover, we rely on complex and fragile mechanisms for purging datasets when the clusters are under heavy load. The use of HDFS also makes it a challenge to rapidly scale up the YARN cluster during peak hours and scale it down during off-hours. We are currently redesigning the data platform, mainly by replacing HDFS with a special-purpose data orchestration service called Alluxio. In our initial evaluation, Alluxio not only provides better performance than HDFS but also significantly simplifies the architecture of our data platform and makes it easy to scale up and down and paves the way to a cloud native ETL processing stack.
Videos:
Presentation Slides:
Complete the form below to access the full overview:
.png)
Videos
AI/ML Infra Meetup | AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

Pratik Mishra delivered insights on architecting scalable, deployable, and resilient AI infrastructure at scale. His discussion on fault tolerance, checkpoint optimization, and the democratization of AI compute through AMD's open ecosystem resonated strongly with the challenges teams face in production ML deployments.
September 30, 2025
AI/ML Infra Meetup | Alluxio + S3 A Tiered Architecture for Latency-Critical, Semantically-Rich Workloads

In this talk, Bin Fan, VP of Technology at Alluxio, presents on building tiered architectures that bring sub-millisecond latency to S3-based workloads. The comparison showing Alluxio's 45x performance improvement over S3 Standard and 5x over S3 Express One Zone demonstrated the critical role the performance & caching layer plays in modern AI infrastructure.
September 30, 2025
AI/ML Infra Meetup | Achieving Double-Digit Millisecond Offline Feature Stores with Alluxio

In this talk, Greg Lindstrom shared how Blackout Power Trading achieved double-digit millisecond offline feature store performance using Alluxio, a game-changer for real-time power trading where every millisecond counts. The 60x latency reduction for inference queries was particularly impressive.
September 30, 2025