Building a high-performance platform on AWS to support real-time gaming services using Presto and Alluxio

Overview

Electronic Arts (EA) is a leading company in the gaming industry, providing over a thousand games to serve billions of users worldwide. The EA Data & AI Department builds hundreds of platforms to manage petabytes of data generated by games and users everyday. These platforms consist of a wide range of data analytics, from real-time data ingestion to ETL pipelines. Formatted data produced by our department is widely adopted by executives, producers, product managers, game engineers, and designers for marketing and monetization, game design, customer engagement, player retention, and end-user experience. 

Near real-time information for EA’s online services is critical for making business decisions, such as campaigns and troubleshooting. These services include, but are not limited to, real-time data visualization, dashboarding, and conversational analytics. Our team is actively seeking a framework that can support these use cases.

At EA, we have adopted numerous data visualization tools, such as Tableau and Dundas, to support data insight analytics. These tools are usually connected with multiple data sources, such as MySQL DB, AWS S3, or HDFS. Users can load data from multiple endpoints simultaneously to run computationally heavy algorithms. One severe performance bottleneck is loading data as it is I/O heavy. This could be exacerbated more if the same data needs to be loaded multiple times. Thus we need a solution to reduce this data retrieving overhead by caching the data locally. 

Dashboarding is another common use case, used to keep tracking user engagements, customer satisfaction, or system status in real-time. In these cases, the data volume is usually on the order of gigabytes, but frequent refreshes require instantaneous processing. Currently, we use commercial databases such as Redshift to serve time-sensitive data, and we are seeking an alternative to slash costs without losing performance. 

We recently developed a reporting chatbot to provide immediate game related insights, such as live user satisfaction and real-time profit analysis. The backend of this system runs Presto with petabytes of data stored on S3. This chatbot converts users’ questions into ANSI SQL and runs these queries in the Presto cluster. The queries usually conduct complex computations, such as prediction and merging after searching across datasets. We are eager to find a solution that compliments our S3 based datasets to improve the performance without introducing extra cost. 

Architecture

We evaluated Alluxio as a data orchestration layer between our storage and data processing platforms. Alluxio has been recognized as a high performance data orchestration system and widely adopted in numerous data processing systems. In our evaluation, we compared a mock of the aforementioned production setup of Presto on S3 against a similar stack with Alluxio. The architecture is shown below:

  • Each instance launched Presto and Alluxio, co-locating the two services. 
  • For hardware, we used three h1.8xlarge AWS instances, each with 8TB ephemeral disks mounted for use by Alluxio to cache data local to Presto.
  • S3 was mounted to Alluxio as the underlying persisting file system.
  • Two catalogs were configured for Presto; one connected to our existing Hive metastore, referencing the benchmark datasets externally stored on S3, and the other connected to a separate Hive metastore with the benchmark tables created in Alluxio.
  • We used the same datasets on S3 for performance comparison and pre-loaded the data into Alluxio with alluxio fs distributedLoad /testDB

The Alluxio cluster was started with the following configuration settings:

# Impersonation 
alluxio.master.security.impersonation.presto.users=*
alluxio.master.mount.table.root.ufs=<s3://alluxio_path>
alluxio.security.authorization.permission.enabled=false
alluxio.security.authentication.type=SIMPLE
alluxio.security.authorization.permission.supergroup=*

# Alluxio Worker tier configuration
alluxio.worker.block.heartbeat.interval=30sec

# Prevent disk thrashing
alluxio.user.file.passive.cache.enabled=false

# Increase Threadpool concurrency for Presto
alluxio.user.block.master.client.pool.size.max=256
alluxio.user.file.master.client.pool.size.max=256

# Return full list of blocks
alluxio.user.ufs.block.location.all.fallback.enabled=true 

# Worker properties
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.level0.dirs.mediumtype=HDD
alluxio.worker.tieredstore.level0.dirs.path=</path1>
alluxio.worker.tieredstore.level0.dirs.quota=<1000GB> 

# file replica
alluxio.user.file.replication.max=3

# User properties
alluxio.user.file.readtype.default=CACHE_PROMOTE
# Writes data only to Alluxio before returning a successful write
alluxio.user.file.writetype.default=MUST_CACHE

Above shows initial alluxio-site.properties

We noticed that Alluxio did not perform as expected when processing large amounts of small files. We enabled metadata caching to tune the performance 

alluxio.user.metadata.cache.enabled=true
alluxio.user.metadata.cache.max.size=100000
alluxio.user.metadata.cache.expiration.time=10min 

Above shows alluxio-site.properties to enable metadata caching

Benchmark Result 

Four independent benchmarks were used to benchmark the performance with and without Alluxio:

Test 1 runs our internal benchmark, which are synthesis snapshots of player in-game events. The datasets are in ORC format with a total size of 1GB, 10GB, and 100GB. Each dataset is created with the same DDL, containing 49 cols, 40 varchar, 5 booleans, and 4 maps. The benchmark queries select all columns with one varchar field filter condition, which is a typical query for I/O heavy use cases. 

Result: Alluxio with metadata caching is 2x to 7x faster than S3.

Test 2 simulates data visualization with game metadata and user engagement records. We selected two commonly used datasets and queries frequently used in Tableau and Dundas respectively. The queries select all columns with a date filtering condition, followed by GROUP BY and ORDER BY of the date. This is a typical query that stresses both CPU and I/O. In this test, we did not need metadata caching enabled in Alluxio as it already shows significant improvement.

Result: Without metadata caching, Presto with Alluxio is 2.75x faster than S3 with the Dundas dataset and 5.1x faster with the Tableau dataset.

Test 3 simulates our dashboarding use case using a dataset with a large number of small files. The datasets are batches of 2MB files, totalling to 50, 500, and 5000 files. The query used is a select query aggregating the number of entries for each date. 

Result: Alluxio with metadata caching is 1.2x to 5.9x faster than S3. Without metadata caching, Alluxio is only 1x to 1.35x faster. Enabling metadata caching significantly reduces the execution time by memorizing metadata, recognizing hot data and increasing replicas. 

Test 4 simulates the conversational bot. The dataset used was a snapshot of daily game performance. The query contains multiple stages of calculation to simulate a CPU intensive query. It converts an integer field into HyperLogLog, merges it, and selects the cardinality. The results are filtered by an integer and varchar field.

Result: Alluxio without metadata caching shortens the timespan from 85.2 seconds to 3 seconds, which improves the performance as much as 27x. 

Conclusion

This blog explores an innovative platform with Presto as the computing engine and Alluxio as a data orchestration layer between Presto and S3 storage, to support online services with instantaneous response within the gaming industry. We evaluated this platform with real industrial examples of data visualization, dashboarding, and a conversational chatbot. Our preliminary results show that Presto with Alluxio outperforms S3 significantly in all cases. In particular, Alluxio with metadata caching shows up to 5.9x performance gain when handling large numbers of small files. Alluxio enables the separation of storage and compute by managing the allocated ephemeral disks to cache data from S3 local to Presto. Advanced cache management with an asymmetric number of replicas for hot vs. cold data accounts for performance gains in each scenario we tested.