alluxio learning center

Beginner to advanced topics on analytics, AI/ML, storage, and cloud concepts

Presto

Introduction to Presto and commonly asked questions
Presto was originally designed at Facebook to run interactive queries against large data warehouses in Hadoop and run fast queries against data warehouses storing petabytes of data.

An Introduction to Presto architecture
A typical Presto deployment will include one Presto Coordinator and any number of Presto Workers. In practice, you might deploy Presto in the cloud or on-prem.

Spark

Introduction to Apache Spark and commonly asked questions
Apache Spark is an open source analytics framework for big data, AI, and machine learning best used for large-scale data processing.

An Introduction to the Apache Spark architecture
Apache Spark includes Spark Core and four libraries: Spark SQL, MLlib, GraphX, and Spark Streaming. Individual applications will typically require Spark Core and at least one of these libraries.

EMR

Introduction to Amazon EMR and MapReduce
Amazon Elastic MapReduce (EMR) is a tool for processing and analyzing big data quickly. Using query tools like Spark, Hive, HBase, and Presto along with storage (like S3) and compute capacity (like EC2).

FAQ on Amazon EMR and EC2
The key differences between Amazon EMR and EC2, and how EMR works.

HDFS

Introduction to Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is the primary data storage system under Hadoop applications. It is a distributed file system and provides high-throughput access to application data.