Introduction to AMAZON EMR and MapReduce
Amazon Elastic MapReduce (EMR) is a tool for processing and analyzing big data quickly. Using query tools like Spark, Hive, HBase, and Presto along with storage (like S3) and compute capacity (like EC2), you can use EMR to run large-scale analysis that’s cheaper than a traditional on-premise cluster.
EMR is based on Apache Hadoop. MapReduce allows developers to process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. The ‘elastic’ in EMR means it has a dynamic and on-demand resizing capability, allowing it scale resources up and down quickly depending on the demand.
EMR is used for many different types of big data use cases, like machine learning, ETL, financial and scientific simulation, bioinformatics, log analysis, and deep learning.
Specific use cases may be performing large data transformations and deliver data to internal and external customers, delivering real-time insights on massive amounts of data on-demand, leveraging the power of the Apache Hadoop framework to perform data transformations, and faster/richer analytics and data warehousing capabilities while reducing costs using EMR for ETL.
Google invented MapReduce in 2003, which back then was a new style of data processing to manage large-scale processing across large clusters of commodity servers. They needed to solve for the massive amount of content on the web that required indexing. A year after Google published the whitepaper that described the MapReduce framework in 2004, Apache Hadoop was created. Hadoop is used to help manage big data – the variety, volume, and velocity of structured and unstructured data.
Amazon EMR significantly reduces the complexity of the time-consuming set-up, management. and tuning of Hadoop clusters or the compute capacity upon which they sit. You can instantly spin up large Hadoop clusters which will start processing within minutes, not hours or days.
Developers can quickly perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. They can also develop and run more sophisticated applications with functionality like scheduling, workflows, monitoring, or other features. To get started you develop your data processing application, upload it and data to S3, and configure and launch a cluster.
Amazon EMR enables you to scale the number of nodes in your cluster up and down by specifying the number and types of instances in the cluster. This contributes to the speed of data processing completion. When there is little or no workload, you can resize your cluster and scale down. Alternatively, when the job is too slow, you can scale your cluster up to add processing power.
You can manually resize a running cluster using the AWS Management Console, AWS CLI, or the Amazon EMR API.
• AWS Management Console: Resize a cluster with new instance count or if your cluster uses instance fleets, input new values for on-demand units and spot units.
• AWS CLI: Increase or decrease the number of task nodes, and increase the number of core nodes.
• Amazon EMR API: Add 1-48 task instance groups to a cluster.
While Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data, EMR implements mentioned framework and provides a managed Hadoop platform to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run frameworks such as Spark and Presto in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
The map function transforms the piece of data into key-value pairs and then the keys are sorted where a reduce function is applied to merge the values based on the key into a single output.
Slave nodes are where Hadoop data is stored and where data processing takes place. In a Hadoop cluster, each data node (also known as a slave node) runs a background process named DataNode. This background process (also known as a daemon) keeps track of the slices of data that the system stores on its computer. It regularly talks to the master server for HDFS (known as the NameNode) to report on the health and status of the locally stored data.
That happens with the following services:
NodeManager: Coordinates the resources for an individual slave node and reports back to the Resource Manager.
ApplicationMaster: Tracks the progress of all the tasks running on the Hadoop cluster for a specific application.
Container: A collection of all the resources needed to run individual tasks for an application. When an application is running on the cluster, the Resource Manager schedules the tasks for the application to run as container services on the cluster’s slave nodes.
TaskTracker: Manages the individual map and reduce tasks executing on a slave node for Hadoop 1 clusters. In Hadoop 2, this service is obsolete and has been replaced by YARN services.
DataNode: An HDFS service that enables the NameNode to store blocks on the slave node.
RegionServer: Stores data for the HBase system. In Hadoop 2, HBase uses Hoya, which enables RegionServer instances to be run in containers.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. You commonly see slave nodes now where each node typically has between 12 and 16 locally attached 3TB hard disk drives. Slave nodes use moderately fast dual-socket CPUs with six to eight cores each — no speed demons, in other words. This is accompanied by 48GB of RAM. In short, this server is optimized for dense storage.
The data processed by MapReduce should be stored in HDFS, which divides the data into blocks and stores distributedly. This is a MapReduce workflow:
1. One block is processed by one mapper at a time. In the mapper, a developer can specify business logic. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel.
2. Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create unnecessary many copies.
3. Output of mapper is shuffled to reducer node (which is a normal slave node but reduce phase will run here). The shuffling/copying is a physical movement of data which is done over the network.
4. Once all the mappers are finished and their output is shuffled on reducer nodes the intermediate output is merged & sorted, which is provided as input to reduce phase.
5. Reduce is the second phase of processing where the user can specify custom business logic. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS.