ADVANCED ANALYTICS & AI ON REMOTE DATA FOR HYBRID AND MULTI-CLOUD

Open Source Data Orchestration for the Cloud

Benchmark & Architecture Report
“Zero-Copy” Hybrid Cloud for Data Analytics

Get the report >

Looking for more compute capacity with remote data?
See how “Zero-Copy” hybrid bursting helps

Get the whitepaper >

Alluxio enables compute

Data locality

Bring your data close to compute.
Make your data local to compute workloads for Spark caching, Presto caching, Hive caching and more.

Data Accessibility

Make your data accessible.
No matter if it sits on-prem or in the cloud, HDFS or S3, make your files and objects accessible in many different ways.

Data On-Demand

Make your data as elastic as compute.
Effortlessly orchestrate your data for compute in any cloud, even if data is spread across multiple clouds.

“zero-copy” burst user spotlight: walmart

Why Walmart chose Alluxio’s “Zero-Copy” burst solution:

  • No requirement to persist data into the cloud
  • Improved query performance and no network hops on recurrent queries 
  • Lower costs without the need for creating data copies

See more on how Alluxio powers Walmart’s “zero-copy” burst solution in their presentation >

Featured Use Cases and Deployments

Managing data copies/app changes when bursting compute to cloud?

Zero-copy hybrid bursting with no app changes to intelligently make remote data accessible in the public cloud.

Expanding compute capacity across geo-distributed data centers?

Zero-copy bursting across data centers for Presto, Spark, and Hive with no app changes on data stored in HDFS.

Interact with Alluxio in any stack

Pick a compute. Pick a storage. Alluxio just works.

Tutorial –> Full Docs –>

-- Pointing Table location to Alluxio 
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/‘)

Full Docs

// Using Alluxio as input and output for RDD
scala> sc.textFile("alluxio://master:19998/Input")             
scala> rdd.saveAsTextFile("alluxio://master:19998/Output")

// Using Alluxio as input and output for Dataframe
scala> df = sqlContext.read.parquet("alluxio://master:19998/Input.parquet")
scala> df.write.parquet("alluxio://master:19998/Output.parquet”)

Full Docs

-- Pointing Table location to Alluxio
hive> CREATE TABLE u_user (
userid INT,
age INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 'alluxio://master:port/table_data';

Full Docs

Create and Query table stored in Alluxio
hbase(main):001:0> create 'test', 'cf'
hbase(main):002:0> list ‘test'

Full Docs

# Running a wordcount using Alluxio as input and output
$ bin/hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount \
  -libjars /<ALLUXIO_HOME>/client/alluxio-<VERSION>-client.jar \
  alluxio://master:19998/wordcount/input.txt \ 
  alluxio://master:19998/wordcount/output

Full Docs

# Accessing Alluxio after mounting Alluxio service to local file system
$ ls /mnt/alluxio_mount
$ cat /mnt/alluxio_mount/mydata.txt
ALLUXIO
$ ./bin/alluxio fs mount \
--option aws.accessKeyId=<AWS_ACCESS_KEY_ID> \
--option aws.secretKey=<AWS_SECRET_KEY_ID> \
alluxio://master:port/s3 s3a://<S3_BUCKET>/<S3_DIRECTORY>

Full Docs

$ ./bin/alluxio fs mount \
alluxio://master:port/hdfs hdfs://namenode:port/dir/

Full Docs

$ ./bin/alluxio fs mount \
--option
fs.azure.account.key.<AZURE_ACCOUNT>.blob.core.windows.net=<AZURE_ACCESS_KEY> \
alluxio://master:port/azure 
wasb://<AZURE_CONTAINER>@<AZURE_ACCOUNT>.blob.core.windows.net/<AZURE_DIRECTORY>/

Full Docs

$ ./bin/alluxio fs mount \
--option fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID> \
--option fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY> \
alluxio://master:port/gcs gs://<GCS_BUCKET>/<GCS_DIRECTORY>

Full Docs

$ ./bin/alluxio fs mount \
--option aws.accessKeyId=<AWS_ACCESS_KEY_ID> \
--option aws.secretKey=<AWS_SECRET_KEY_ID> \
--option alluxio.underfs.s3.endpoint=http://<rgw-hostname>:<rgw-port> \
--option alluxio.underfs.s3.disable.dns.buckets=true \
alluxio://master:port/ceph s3a://<S3_BUCKET>/<S3_DIRECTORY>

Full Docs

$ ./bin/alluxio fs mount alluxio://master:port/nfs /mnt/nfs
 
 
 
 
 

Full Docs

Announcing Alluxio 2.0! Learn more about the release >

powered by alluxio

What’s Happening

Event
Build a hybrid data lake and burst processing to Google Cloud Dataproc with Alluxio

Join us for this tech talk where we will show you how Alluxio can help burst your private computing environment to Google Cloud, minimizing costs and I/O overhead. Alluxio coupled with Google’s open source data and analytics processing engine, Dataproc, enables zero-copy burst for faster query performance in the cloud so you can take advantage of resources that are not local to your data, without the need for managing the copying or syncing of that data.

Alluxio Tech Talk *
Blog
Efficient Model Training in the Cloud with Kubernetes, TensorFlow, and Alluxio

A collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problems of Deep Learning model training in the cloud. Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.

Press Release
Alluxio Recognized by CRN, Data Breakthrough Awards and InsideBIGDATA

Alluxio, the developer of open source cloud data orchestration software, today announced it has been named to the Computer Reseller News (CRN) Big Data 100 list – “The Coolest Data Management and Integration Tool Companies,” chosen a 2020 Data Breakthrough Awards “Best Data Access Solution of the Year” winner, and awarded an honorable mention on InsideBIGDATA “IMPACT 50 List for Q2 2020.”

Blog
What’s new in Alluxio 2.2

With this release comes the General Availability (GA) of Alluxio Structured Data Services (SDS), the subsystem of Alluxio responsible for managing and transforming structured data, such as databases, tables, and partitions.