we introduce Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.
Tag: data engineering
Welcome to the first event of the Cloud, Data, & Orchestration Austin Meetup! This meetup will feature two talks and an opportunity to engage with other data engineers, developers, and Alluxio users. Thanks to Bazaarvoice for hosting!
Today, real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to dramatically reduce the workload pressure on our HDFS NameNode.
In this meetup, Dipti and HY will present a new approach to hybrid analytical workloads using Alluxio, an open source data orchestration layer, which sits between compute and storage layer. Applications like Apache Spark or TensorFlow can then seamlessly access multiple disparate data sources with consistent performance using data locality and abstraction that the data orchestration tier brings.
Cloud has changed the dynamics of data engineering in many ways, from changing expectations of on-demand platform services to the popularity of the object store to the emergence of a flexible, separated data stack. And as a data engineer venturing into this cloudy world, the understanding of specific architectural approaches coupled with knowledge in some data stacks has proven useful.
Instead of being purely focused on data infrastructure, today’s data engineer is now a full stack engineer. Compute, containers, storage, data movement, performance, network – skills are increasingly needed across the broader stack.
This white paper attempts to discuss some design principles as well as high priority elements of the stack that a data engineer should think about.
Tags: data engineering
Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premise only dealt with databases and some parts of the hadoop stack.
In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full stack engineer (leaving out the final end application perhaps). Compute, containers, storage, data movement, performance, network — skills are increasing needed across the broader stack. Here are some design concept and data stack elements to keep in mind.
Over the years of working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit or public enemy №1. There have been many attempts to do away with data silos, but those attempts themselves have resulted in yet another data silo, with data lakes being one such example. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.
Alluxio is an open source software solution that connects analytics applications to heterogeneous data sources through a data orchestration layer that sits between compute and storage.
MesosCon Europe 2017 – Gene Pang discusses the architecture of Mesos, Spark and Alluxio to achieve an optimal architecture for enterprises.