Today, real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to dramatically reduce the workload pressure on our HDFS NameNode.
Tag: data engineering
In this meetup, Dipti and HY will present a new approach to hybrid analytical workloads using Alluxio, an open source data orchestration layer, which sits between compute and storage layer. Applications like Apache Spark or TensorFlow can then seamlessly access multiple disparate data sources with consistent performance using data locality and abstraction that the data orchestration tier brings.
Cloud has changed the dynamics of data engineering in many ways, from changing expectations of on-demand platform services to the popularity of the object store to the emergence of a flexible, separated data stack. And as a data engineer venturing into this cloudy world, the understanding of specific architectural approaches coupled with knowledge in some data stacks has proven useful.
Instead of being purely focused on data infrastructure, today’s data engineer is now a full stack engineer. Compute, containers, storage, data movement, performance, network – skills are increasingly needed across the broader stack.
This white paper attempts to discuss some design principles as well as high priority elements of the stack that a data engineer should think about.
Tags: data engineering
Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premise only dealt with databases and some parts of the hadoop stack.
In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full stack engineer (leaving out the final end application perhaps). Compute, containers, storage, data movement, performance, network — skills are increasing needed across the broader stack. Here are some design concept and data stack elements to keep in mind.
Over the years of working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit or public enemy №1. There have been many attempts to do away with data silos, but those attempts themselves have resulted in yet another data silo, with data lakes being one such example. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.
Alluxio is an open source software solution that connects analytics applications to heterogeneous data sources through a data orchestration layer that sits between compute and storage.
MesosCon Europe 2017 – Gene Pang discusses the architecture of Mesos, Spark and Alluxio to achieve an optimal architecture for enterprises.
Strata Data Conference London 2017 – Learn about stream processing on Alluxio from real-world workloads at Qunar, as well as how to position Alluxio in the streaming architecture
Joint webinar – Mesosphere DC/OS is a production-proven platform that powers both modern app components – containers and data services – so businesses can accelerate time to market with confidence, and save. We have seen tremendous interest from users to be able to run Alluxio via DC/OS.