When working with big data analytics and AI, you are likely reading and writing terabytes from S3, in some cases even at very high data transfer rates. There are several reasons why you may find S3 is slowing down your data-intensive job.
You may want to take a look at a few aspects.
- Region and connectivity – Make sure your S3 region and EC2 region match. Some of the newer regions have better performance than others, for example Oregon can be faster than Virginia in some cases.
- If you have remote data or if the region where your computation is running and region of the S3 bucket doesn’t match, you may want to cache your data local to the compute to achieve stronger data locality. Alluxio can help cache data for frameworks like Spark, Presto, Hive, Tensorflow and more.
- Instance type – Make sure the EC2 instance you picked matches your requirements. The various AWS Instance types have different bandwidth network connectivity. Look at “Network Performance” ec2instances.info list to check the bandwidth.
- Extent of concurrency of the workload – Each S3 operation is an API request with significant latency — tens to hundreds of milliseconds, which adds up to pretty much forever if you have millions of objects and try to work with them one at a time. So what determines your overall throughput in moving objects is how many worker threads (connections) on one instance and how many instances are used. Make sure you job is has enough threads to meet your performance requirements.
In some cases, even when you have sufficient concurrency, you may see throttling from S3. This is because S3 may not be able to keep up with requests. Your application can achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket.
If these tips still don’t meet your performance requirements, you may want to consider caching layers like Alluxio that not only cache data but also S3 metadata, making operations like ‘List’ and ‘Rename’ significantly faster. In addition, you can asynchronously write to S3 further improving performance of write-heavy applications. See our documentation for more details and how to get started.