The purpose of Alluxio is to be an abstraction layer with storage systems underneath it. Alluxio is designed in a way that it assumes that there’s a storage layer underneath, so using it as another storage system does not solve the problem of having storage and compute co-located. Alluxio allows you to have long-running data … Continued
Haoyuan Li’s keynote at O’Reilly Beijing discusses open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.
The data orchestration layer bridging the gap between data locality with improved performance and data accessibility for analytics workloads in Kubernetes, and enables portability across storage providers.
An overview of Alluxio and the cloud use case with Spark in Kubernetes. Learn how to set up Alluxio and Spark to run in Kubernetes.
Haoyuan Li presents at Beijing Meetup on open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.
As the data ecosystem becomes massively complex and more and more disaggregated, data analysts and end users have trouble adapting and working with hybrid environments. The proliferation of compute applications along with storage mediums leads to a hybrid model that we are just not accustomed to.
With this disaggregated system data engineers now come across a multitude of problems that they must overcome in order to get meaningful insights.
Twitter SF is hosting 2019’s half yearly RocksDB Meetup with speakers from Twitter, Facebook and the community on July 11th.
Join us June 24 in Menlo Park for our next meetup! We’ll have 3 valuable talks, a delicious BBQ dinner and amazing summertime-themed raffle prizes! This free event is sponsored by GridGain Systems and Oracle.
Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this.
If we add more Spark jobs across multiple clusters, you could have something like this.
As the data ecosystem within enterprises grow larger and larger, not only do we see an increase in total data volumes but also an increase in the disparate storage systems in which they are housed. The challenge then becomes how do different applications and teams have an efficient way of being able to access data … Continued