How do you reduce large performance variance in HDFS namenode?

Some people experience serious performance issue in HDFS namenode (v2.7) response time. Particularly during peak traffic time, an HDFS namenode can become overloaded and some DFS operations (like listing a directory) can take a long time, which affects the query response time for Presto and other Hadoop applications.

To solve for challenges in high latency Namenode RPC latency during peak times, you can use a multi-layer architecture. For example, if you have a big, highly utilized Hadoop cluster (thousands of nodes) with smaller computing clusters (about 1 thousand nodes) around it, you can run Presto and other different frameworks on Alluxio. Alluxio will serve as a caching layer to the big HDFS cluster. In this way, the data and metadata service pressure will be shielded by the Alluxio deployment.

For more details, you can look at this Strata Presentation from JD.com that shows how they use Alluxio as a fault tolerant pluggable optimization component to compute frameworks.