alluxio learning center

Beginner to advanced topics on analytics, AI/ML, storage, and cloud concepts

Hybrid Cloud

A Guide to Cloud Bursting
Cloud bursting spreads computing load across both private and public hybrid cloud infrastructures. Find out how they work and more with Alluxio.

Presto

Introduction to Presto and commonly asked questions
Presto was originally designed at Facebook to run interactive queries against large data warehouses in Hadoop and run fast queries against data warehouses storing petabytes of data.

An Introduction to Presto architecture
A typical Presto deployment will include one Presto Coordinator and any number of Presto Workers. In practice, you might deploy Presto in the cloud or on-prem.

Presto and Hadoop
What is a query engine, more specifically, a SQL query engine? Learn about the benefits of using, along with examples.

What is a query engine?
What is a query engine, more specifically, a SQL query engine? Learn about the benefits of using, along with examples.

Spark

Introduction to Apache Spark and commonly asked questions
Apache Spark is an open source analytics framework for big data, AI, and machine learning best used for large-scale data processing.

An Introduction to the Apache Spark architecture
Apache Spark includes Spark Core and four libraries: Spark SQL, MLlib, GraphX, and Spark Streaming. Individual applications will typically require Spark Core and at least one of these libraries.

EMR

Introduction to Amazon EMR and MapReduce
Amazon Elastic MapReduce (EMR) is a tool for processing and analyzing big data quickly. Using query tools like Spark, Hive, HBase, and Presto along with storage (like S3) and compute capacity (like EC2).

FAQ on Amazon EMR and EC2
The key differences between Amazon EMR and EC2, and how EMR works.

How to Use Presto on Amazon EMR
Amazon EMR provides scalable compute in the cloud, including interactive queries with Presto, for big data in S3 storage.

HDFS

Introduction to Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is the primary data storage system under Hadoop applications. It is a distributed file system and provides high-throughput access to application data.

Basic HDFS File Operations Commands
Learn basic HDFS commands in Linux, enabling you to create and list directories, move, delete, read files, and more.