Today, one can easily launch or terminate services with hundreds or thousands of compute instances in just a few seconds on cloud services such as AWS. However, operating, monitoring and maintaining those resources could also easily become a nightmare if the corresponding systems were not designed in a cloud-native way.
In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA). In the first generation of the monitoring system, configurations were manually created for many individual software components and spread over all the resources. As services were started and terminated rapidly over time, it was extremely difficult to keep all configurations up to date. Consequently, on average we received over 1,000 alerts from thousands of machines on a daily basis, which stressed the operations team. We redesigned the system in late 2018 in a project called Monitoring As Code (MAC) emphasizing on version control and automation. MAC manages all the configurations using a GIT project in the same way as software code. Moreover, it establishes standards so that the configurations are automatically generated and deployed to keep everything in sync. As a result, it reduced the daily average number of alerts by two orders of magnitude.
In the first generation of the data platform, we used HDFS as a cache layer between ETL jobs and the underlying AWS storage service S3. However, HDFS is not a special-purpose cache service, so custom code is needed to make it work like a cache. We have to run a backup workflow in every ETL job to backup data to S3 and sync the metadata store of the ETL jobs running on HDFS and that of interactive analytic queries running directly on S3. Moreover, we rely on complex and fragile mechanisms for purging datasets when the clusters are under heavy load. The use of HDFS also makes it a challenge to rapidly scale up the YARN cluster during peak hours and scale it down during off-hours. We are currently redesigning the data platform, mainly by replacing HDFS with a special-purpose data orchestration service called Alluxio. In our initial evaluation, Alluxio not only provides better performance than HDFS but also significantly simplifies the architecture of our data platform and makes it easy to scale up and down and paves the way to a cloud native ETL processing stack.
Du Li, is currently an Architect of Data Infrastructure at Electronic Arts. He worked in academia and industrial labs for many years after earning his PhD degree from UCLA. Prior to joining EA in mid-2018, he was a software engineer at Yahoo and Apple.