Frequently Asked Questions about AMAZON EMR AND EC2

What is Amazon Elastic MapReduce?

Amazon Elastic MapReduce allows users to bring up a cluster with a fully integrated analytics and data pipelining stack in the matter of minutes.

What are the benefits of Amazon Elastic MapReduce?

Instead of installing software natively on hardware which takes hours or even days to install and configure, Amazon EMR brings up a cluster with the data frameworks needed in a matter of minutes.
Clusters can be brought up when needed and taken down when the jobs complete, saving costs and giving data engineering teams a lot of flexibility.

What is the difference between AWS EMR and EC2?

Amazon Elastic Compute Cloud (Amazon EC2) is a service that provides computational resources in the cloud. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. 

Amazon Elastic MapReduce (EMR) on the other hand is a cloud service specifically focused on analytics and runs on top of EC2 instances. It comes with the Hadoop stack installed. Users can also decide to add services like Spark, Presto, Hive and others as needed, based on the analytics desired. An EMR instance costs a little bit extra as compared to an EC2 instance.

How does AWS EMR work?

Amazon EMR service consists of several components: compute, storage, and cluster resource management. 

Storage in EMR cluster

There are several different options for storing data in an EMR cluster 

  1. Hadoop Distributed File System (HDFS)
    Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. However data needs to be copied in and out of the cluster. It does not get automatically synced with AWS S3. Commands like distCP are required.
  2. EMR File System (EMRFS)
    Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster.
  3. Local File System
    The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

Computational Frameworks

EMR provides several different frameworks for analysis including MapReduce, Spark, Presto and Hive being the most popular. Different frameworks are available for different kinds of processing needs, such as batch, interactive, in-memory, streaming, and so on.

  1. Hadoop MapReduce
    Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. Hive is used underneath MapReduce as a popular execution engine.
  2. Apache Spark
    Spark is a cluster framework and programming model for processing big data workloads. Like Hadoop MapReduce, Spark is an open-source, distributed processing system but uses directed acyclic graphs for execution plans and in-memory caching for datasets.
  3. Presto
    Presto is a popular query engine for interactive analysis on structured data. It works either on HDFS or S3 on a range of file formats like Parquet, ORC and others.

Cluster Resource Management

The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data. By default, Amazon EMR uses YARN. Amazon EMR also has an agent on each node that administers YARN components, keeps the cluster healthy, and communicates with Amazon EMR.

Does Amazon EMR use HDFS?

Yes, the Amazon EMR cluster comes with Hadoop installed. Hadoop includes the HDFS storage system. Users can use HDFS to store data. They could also use Amazon S3 or the local disks that come with the instances in the cluster.

What elastic infrastructure does AWS EMR use?

When you request an EMR cluster, you can pick a specific instance type. The cluster will use the instance type selected and launch EC2 instances.

Which are the most used open source apps in AWS EMR?

The most commonly used open source frameworks on Amazon EMR are Apache Spark, Hive as well as Presto. These are used for batch and interactive analysis as well as data transformation.

What does AWS EMR stand for?

Amazon Elastic MapReduce (EMR), an Amazon Web Services (AWS) tool for big data processing and analysis

How does MapReduce work?

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

What is MapReduce used for?

MapReduce is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, Singular Value Decomposition, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.

Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, multi-cluster, volunteer computing environments, dynamic cloud environments, mobile environments, and high-performance computing environments.

What is MapReduce key value pair?

Key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution. The MapReduce framework operates exclusively on key-value pairs. The framework views the input to the job as a set of key-value pairs and produces a set of key-value pairs as the output of the job, conceivably of different types.

Additional Resources

Introduction to Amazon EMR and MapReduce