How do I troubleshoot/fix slow running jobs in Hadoop?
So you have a Hadoop cluster that’s running fine and then you start to hear people saying that their jobs are running slow. This happens all the time and is a common part of the care and feeding of your Hadoop cluster. As your cluster is a shared resource to run a wide variety of jobs, this can happen for a variety of reasons. This answer is meant to cover common reasons for slowness and look at some solutions to this problem.
It’s fairly easy to check if you have a hardware problem. As you have a distributed system, you’ll have to check where the slowness is coming from. Is it the name node, is it a data node, or it is a NIC card? You can run a thread dump to see what’s happening to the slow job. Also you can check the memory and disk utilization. And finally you can look at the garbage collection (GC) pauses.
But is your cluster simply out of capacity?
As organizations become more data-driven, the number and scope of data jobs is rapidly increasing. More data analysts and scientists typically means more data jobs. Add in a new application framework like Presto or TensorFlow and that can add a lot more jobs. Some of the growth in compute jobs can be planned and the cluster can be sized ahead of time. But many times, data analysts and scientists cannot easily predict what they need, nor will they know how much compute their workloads translate into. Oftentimes clusters have a monthly pattern that looks something like this:
Source: AWS EMR Migration Guide
As you can see above depending on the jobs that are running concurrently, the cluster can become over-utilized:
One easy way to determine this is to look on your metrics for CPU-load. If you’re running 90% or above, you’re likely running more jobs than your cluster can process. This means your cluster is compute-bound and oftentimes, jobs will run slower because they will need to wait in long job queues or because the cluster itself is saturated, causing slowness in jobs.
Options when you’re out of capacity:
1. Tell people to wait for their slow jobs or run their jobs at a time when the cluster is under-utilized. This is the “do nothing” option.
2. See if there are unusual workloads and change those jobs. While it’s easy to see which jobs are taking up the most compute capacity, usually you cannot tell your internal customers to rewrite or change their code. This is the “I’m going to help you even though you don’t want my help” option.
3. Buy more servers to increase overall compute capacity. Of course, you need to have the budget to cover the capital costs. Plus you’ll need the time that is required to rack and stack the new gear. This may require more networking, more racks, and more datacenter space. This is the “I’m made of money” option.
4. Burst the transient workloads to the cloud. Cloud computing is perfect for handling transient workloads. For one, you only pay for what you use and for two, you can usually use the spare CPU cycles for your compute at a major discount. For example, AWS Spot Instance can be 80% lower cost than other non-pre-emptible instances. This is the “Let’s see if we can leverage the cloud” option. There are 3 common ways of doing bursting:
4a) Lift and Shift all or part of my HDFS cluster into the cloud. You would look at the jobs and create a new HDFS cluster using Public Cloud instances. Usually you would use an object store like S3 to be the backing store. This requires an understanding of what data is needed to create a new silo of compute and storage. This requires some planning and synchronizing of the data copies. It also results in additional costs of the cloud and HDFS.
4b) Copy a subset of your data for the specific workload into a cloud object store and run analytics jobs on that. This requires some planning and synchronizing of the data copies.
4c) Compute-driven Data Caching to cloud with data on-prem. In this case, the data that is needed is determined by the compute frameworks themselves. So there’s nothing to copy and keep in sync. This is ideal for an immediate and scalable solution to augment your on-prem compute capacity. One of the common uses for Alluxio is for this. Try it for free today: See the Alluxio Zero-Copy Bursting Solution.
Here is a summary of bursting solutions: