data engineering Archives | Page 2 of 3

The Practice of Alluxio in Ctrip Real-Time Computing Platform

July 19, 2019 By Jianhua Guo

Today, real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to dramatically reduce the workload pressure on our HDFS NameNode.

Accelerating Analytical Workloads for Public & Hybrid Clouds

New York Meetup * July 10, 2019

In this meetup, Dipti and HY will present a new approach to hybrid analytical workloads using Alluxio, an open source data orchestration layer, which sits between compute and storage layer. Applications like Apache Spark or TensorFlow can then seamlessly access multiple disparate data sources with consistent performance using data locality and abstraction that the data orchestration tier brings.

Designing for the cloud: What today’s Data Engineer should be considering when building their stack

June 28, 2019

Cloud has changed the dynamics of data engineering in many ways, from changing expectations of on-demand platform services to the popularity of the object store to the emergence of a flexible, separated data stack. And as a data engineer venturing into this cloudy world, the understanding of specific architectural approaches coupled with knowledge in some data stacks has proven useful.

Instead of being purely focused on data infrastructure, today’s data engineer is now a full stack engineer. Compute, containers, storage, data movement, performance, network – skills are increasingly needed across the broader stack.

This white paper attempts to discuss some design principles as well as high priority elements of the stack that a data engineer should think about.

Tags: data engineering

Embracing Data Silos — the journey through a fragmented data world

June 21, 2019 By Amelia Wong and Bin Fan

Over the years of working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit or public enemy №1. There have been many attempts to do away with data silos, but those attempts themselves have resulted in yet another data silo, with data lakes being one such example. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.

Alluxio Overview: Unify Data at Memory Speed

September 14, 2018 by Haoyuan Li & Bin Fan

Alluxio is an open source software solution that connects analytics applications to heterogeneous data sources through a data orchestration layer that sits between compute and storage.

Tags: alluxio engineering, big data, compute storage separation, data, data engineering, data orchestration, overview, storage, unified namespace

Accelerating Spark Workloads in a Mesos Environment

October 26, 2017 by Gene Pang

MesosCon Europe 2017 – Gene Pang discusses the architecture of Mesos, Spark and Alluxio to achieve an optimal architecture for enterprises.

Tags: alluxio engineering, apache spark, architecture, aws s3, ceph, compute, conference, data, data engineering, Google Cloud Storage, hdfs, spark, storage, unified namespace

Introduction To Alluxio (formerly Tachyon) and How It Brings Up To 300x Performance Improvement To Qunar’s Streaming Processing

May 19, 2017 by Yupeng Fu, Xueyan Li [Qunar]

Strata Data Conference London 2017 – Learn about stream processing on Alluxio from real-world workloads at Qunar, as well as how to position Alluxio in the streaming architecture

Tags: architecture, big data, conference, data, data engineering, distributed systems, performance, storage, tiered storage, unified namespace

Tag: data engineering