Introduction to Apache Spark and commonly asked questions

Spark with Alluxio brings together two open source technologies to give you better performance with Alluxio’s data caching layer and enable hybrid cloud environments for Spark jobs running in the cloud and data on-prem. Alluxio brings back data locality to Spark’s distributed analytics engine in disaggregated environments and provides an intelligent and highly available data tier for Spark.

What is Apache Spark, what is it used for in big data, and why do you need it?

Apache Spark is an open source analytics framework for big data, AI, and machine learning developed out of the UC Berkeley AMPLab. It’s best used for large-scale data processing. Specifically developers will use Spark to run faster workloads, quickly write applications in Java, Scala, Python, R, and SQ, and to combine all types of analytics in the same application. Companies like Databricks offer enterprise versions of Apache Spark.

The top use cases someone might use Spark are for streaming data to analyze in real-time, machine learning to run repeated queries on sets of data, and interactive analytics to run queries without sampling.

How do you use Spark with Hadoop? Is it Spark vs Hadoop?

Hadoop is the primary data storage system under Hadoop applications. It is a distributed file system and provides high-throughput access to application data. It’s part of the big data landscape and provides a way to manage large amounts of structured and unstructured data. HDFS distributes the processing of large data sets over clusters of inexpensive computers. 

You can store and process data with Hadoop in a distributed environment, across various clusters of computers using simple programming constructs. Hadoop’s MapReduce divides tasks into small parts and assigns them to a set of computers. Spark is similar to MapReduce, but Spark’s data processing engine has been shown to be faster and easier than MapReduce with its in-memory processing and stream-processing. 

Spark can run with Hadoop on-prem or in the cloud, using HDFS as the distributed file system. By combining the two technologies, you get Hadoop’s low-cost operation on commodity hardware for disk-heavy operations with Spark’s more costly in-memory processing architecture for high-processing speed, advanced analytics and multiple integration support. This gives you better results overall.

Is Spark built on top of Hadoop?

Spark can be deployed on top of Hadoop but it’s not necessarily “built” on top of Hadoop. Spark works with many other storage systems as well including AWS S3, HBase, and more. Many companies deploy Spark with Hadoop because one enhances the other. When is coupled with Hadoop, Spark reads and writes data to and from Hadoop and gets better processing capabilities.

Why is Spark so fast?

One of the big advantages of Spark is that it does processing in the main memory of worker nodes and prevents unnecessary I/O operations on disks. This is an in-memory caching abstraction and makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don’t need to be read from disk for each operation.

Compared to MapReduce, Spark is very fast. That’s because MapReduce persists full datasets to HDFS after running each job while Spark passes data directly without writing to persistent storage. Spark can also launch tasks faster than MapReduce.

What are the features of Apache Spark?

Apache Spark includes:

  • Spark Context: This is located in the Master Node’s driver program. Spark Context is a gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.
  • Cluster Manager: This manages various jobs. The driver program and Spark Context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.
  • Executors/Workers: These are slave nodes that execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and results are returned back to the Spark Context.

RDDs are the building blocks of any Spark application. 

  • Resilient: Fault tolerant and is capable of rebuilding data on failure
  • Distributed: Distributed data among the multiple nodes in a cluster
  • Dataset: Collection of partitioned data with values

The data in RDDs is split into chunks based on a key. RDDs are highly resilient – they recover quickly because the same data is replicated across multiple executor nodes. So even if one executor node fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes. 

Can you run Spark without Hadoop?

Yes. Spark is not tied to Hadoop’s HDFS and works with many other file systems like AWS S3, Hbase, and more.

What is an action in Spark?

An action in Spark is part of Spark’s RDD operations. RDDs are the building blocks of any Spark application. The two types of operations in Spark are transformations and actions. A transformation produces a new RDD from the existing RDDs, and an action is performed when directly working with the dataset. Unlike a transformation, when the action is triggered after the result, a new RDD is not formed.

What are the programming languages that Apache Spark supports?

Apache Spark supports Java, Scala, Python, R, and SQL.

Additional Resources

An Introduction to the Apache Spark Architecture