Hybrid Cloud Analytics: Scaling analytics workloads on on-premise to public clouds with Alluxio

This whitepaper details how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Starburst Presto, Spark, and Hive with Alluxio in the public cloud using on-prem HDFS.

The paper also includes a real world case study on Two Sigma, a leading hedge fund based in New York City, who deployed large clusters of Google Compute Engine VMs with Spark and Alluxio using on-prem HDFS as the underlying storage tier.

Tags: , , , , ,

Designing for the cloud: What today’s Data Engineer should be considering when building their stack

Cloud has changed the dynamics of data engineering in many ways, from changing expectations of on-demand platform services to the popularity of the object store to the emergence of a flexible, separated data stack. And as a data engineer venturing into this cloudy world, the understanding of specific architectural approaches coupled with knowledge in some data stacks has proven useful. 

Instead of being purely focused on data infrastructure, today’s data engineer is now a full stack engineer. Compute, containers, storage, data movement, performance, network – skills are increasingly needed across the broader stack. 

This white paper attempts to discuss some design principles as well as high priority elements of the stack that a data engineer should think about. 

Tags:

Accelerating analytics on AWS EMR & AWS S3 with Alluxio in a disaggregated data stack

The AWS EMR service has made it easy for enterprises to bring up a full-featured analytical stack in the cloud that elastically scales based on demand. 

The EMR service along with S3 provides a robust yet flexible platform in the cloud with the click of a few buttons, compared to the highly complex and rigid deployment approach required for on-premise Hadoop Data platforms. However, because data on AWS is typically stored in S3, an object store, you lose some of the key benefits of compute frameworks like Apache Spark and Presto that were designed for distributed file systems like HDFS.

In this white paper, we’ll share some of the challenges that arise because of the impedance mismatch between HDFS and S3, the expectations of analytics workloads of the object store, and how Alluxio with EMR addresses them.

Tags: , , ,

Datasheet: What is Alluxio?

Get the Alluxio datasheet to learn more about open source data orchestration for big data and machine learning in the cloud. Proven at global web scale in production for modern data services, Alluxio is the developer of open source data orchestration software for the cloud. Alluxio moves data closer to big data and machine learning … Continued

Tags: ,

Achieving 10x acceleration of Spark and Hive Jobs on AWS S3 with Alluxio Tiered Storage

The data engineering team at Bazaarvoice, a software-as-a-service digital marketing company based in Austin, Texas, must handle data at massive Internet-scale to serve its customers. Facing challenges with scaling their storage capacity up and provisioning hardware, they turned to Alluxio’s tiered storage system and saw 10x acceleration of their Spark and Hive jobs running on AWS S3.

In this whitepaper you’ll learn:

  • How to build a big data analytics platform on AWS that includes technologies like Hive, Spark, Kafka, Storm, Cassandra, and more
  • How to setup a Hive metastore using a storage tier for hot tables
  • How to leverage tiered storage for maximized read performance

Tags: , , , , , ,

Effective caching of Spark Resilient Distributed Datasets (RDDs) with Alluxio

Organizations like Baidu and Barclays have deployed Alluxio with Spark in their architecture, and have achieved impressive benefits and gains. Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their blog post, they described how Alluxio improved their system architecture, and … Continued

Testing Distributed Systems in the Big Data Ecosystem at 1000+ node Scale

Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.

Tags: , , ,

Providing a Unified Data Layer at Memory Speed for Cloud Environments with Huawei and Alluxio

The cloud is rapidly becoming ubiquitous, with continued adoption focused on the flexibility and cost benefits of a utility infrastructure model. Enterprises are increasingly taking a “data first” view of infra- structure, which demands a new way of thinking in a world in which data is stored and accessed from multiple locations and providers. Performance and interoperability challenges, however, can present obstacles to cloud adoption and complicate data management. Techniques such as the use of data silos, ETL processes and multiple data copies, which are commonly employed to accommodate cloud limitations, often tend to offset the expected benefits of cloud infrastructure. Alluxio offers a new way to enhance the benefits of cloud infra- structure without the performance limitations or interoperability challenges resulting from accessing disparate data sources in multiple, often remote, locations.

Tags: , , , ,