Why am I seeing AWS S3 500 / 501 errors?

Increasingly S3 is being used as a data store for analytical and machine learning workloads. This means that it is very easy to generate a massive amount of get operations and request data from S3. For example: a couple of commands can launch a 1000 node cluster of AWS EMR service with the Spark or … Continued

Tags: ,

How do you run Spark on On-Premise S3?

Introducing S3 and Spark S3 has become the de-facto standard API for digital business applications to store unstructured data chunks. To this end, several vendors have S3-API compatible offerings that allow app developers to standardize on the S3 API’s on-premise, and port these apps to run on other platforms when ready. So, what is S3 and … Continued

Tags: , , ,

Top 5 Performance Tuning Tips for Running Presto on Alluxio

Presto is an open source distributed SQL engine widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Alluxio is an open-source distributed file system that provides a unified data access layer at in-memory speed. The combination of Presto and Alluxio is getting more popular in many companies like JD, NetEase to leverage Alluxio … Continued

Alluxio Developer Tip: Why am I seeing the error “User yarn is not configured for any impersonation. impersonationUser: foo?”

What is User Impersonation? Impersonation is simply the ability for one user to act on behalf of another user. For example, say user ‘yarn’ has the credentials to connect to a service, but user ‘foo’ does not. Therefore, user ‘foo’ would never be able to access the service. However, user ‘yarn’ can access the service … Continued

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the … Continued

Developer Tip: Why Did My Job Fail with Error Message “Class alluxio.hadoop.FileSystem not found”?

From time to time, a question pops up on the user mailing list referencing job failures with the error message “java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found“. This post explains the reason for the failure and the solution to the issue when it occurs. Why does this happen? This error indicates the Alluxio client is not available at runtime. … Continued