Which kind of EC2 instance is more recommended for use with Alluxio with applications like Presto/Spark? Does it make a big difference to have EBS disks with IOPS?

Presto and Spark are CPU-bound so they require CPU intensive instances. But on the other hand, they also need memory so the R4/R5 instances are what most users end up using for their Presto/Spark workloads. The memory itself will get distributed across Presto/Spark and Alluxio, and typically we see about 60% going to compute, 30% to Alluxio and the rest to OS.

For EBS volumes with provisioned IOPS, they can make a large difference. Alluxio can leverage memory and disk as tiers of storage and manage the migration of data between these tiers automatically and intelligently. If the size of the data is larger than what will fit in the available memory and frequently requested, it can make sense to leverage EBS volumes to keep the data closer to the compute as opposed to fetching it from the storage system.

Tags: presto, spark

Alluxio and Compute Answers