big data Archives

Zookeeper vs Raft: Stateful Distributed Coordination with HA and Fault Tolerance

October 21, 2022

Big Data Bellevue & Cloudy With a Chance of Data Meetup October 20, 2022 Distributed systems are made up of many components such as authentication, a persistence layer, stateless services, load balancers, and stateful coordination services. These coordination services are central to the operation of the system, performing tasks such as maintaining system configuration state, … Continued

Tags: big data, distributed systems, fault tolerance, high availability, meetup, raft, zookeeper

ML-Based SQL Query Resource Usage Prediction

September 15, 2022

With the advent of the Big Data era, it is usually computationally expensive to calculate the resource usages of a SQL query. Can we estimate the resource usages of SQL queries more efficiently without any computation in a SQL engine kernel? In this session, Chunxu and Beinan would like to introduce how Twitter’s data platform leverages a machine learning-based approach in Presto and BigQuery to estimate query utilization with 90%+ accuracy.

Tags: alluxio day, big data, machine learning, presto, sql, twitter

A Year with Alluxio Community 2021

January 20, 2022 By Bin Fan and Jasmine Wang

2021 marked accelerated growth for the Alluxio Open Source Project. We could not be more grateful for what the community has achieved together in this past year. This blog provides a glimpse of the year long summary of our community growth.

Best Practice in Accelerating Data Applications with Spark+Alluxio

October 12, 2021

This talk shares the designs and use cases of the Alluxio and Spark integrated solutions, as well as the best practice and “what not to do” in designing and implementing Alluxio distributed systems.

Tags: alluxio day, big data, data orchestration, distributed systems, spark

The practice of Presto & Alluxio in E-commerce big data platform

December 13, 2020

JD.com is one of the largest e-commerce corporations. In big data platform of JD.com, there are tens of thousands of nodes and tens of petabytes off-line data which require millions of spark and MapReduce jobs to process everyday. As the main query engine, thousands of machines work as Presto nodes and Presto plays an import role in the field of In-place analysis and BI tools. Meanwhile, Alluxio is deployed to improve the performance of Presto. The practice of Presto & Alluxio in JD.com benefits a lot of engineers and analysts.

Tags: big data, data orchestration, data orchestration summit, presto

Accelerate and Scale Big Data Analytics with Alluxio and Intel® Optane™ Persistent Memory

May 8, 2020

International Data Corporation (IDC) reported that the global datasphere will grow from 33 zettabytes in 2018 to 175 zettabytes by 20251. This trend becomes more and more complicated with the variety and velocity of data growth, and it continuously changes the ways data is collected, stored, processed, and analyzed. New analytics solutions, including machine learning, deep learning, and artificial intelligence (AI), and new architectures and tools are being developed to extract and deliver value from the huge datasphere.

Tags: analytics, big data, hybrid cloud, intel, open source, performance, persistent memory

Alluxio Accelerates Deep Learning in Hybrid Cloud using Intel’s Analytics Zoo open source platform powered by oneAPI

April 28, 2020

This article describes how Alluxio accelerates the training of deep learning models in a hybrid cloud environment with Intel’s Analytics Zoo open source platform, powered by oneAPI. Details on the new architecture and workflow, as well as Alluxio’s performance benefits and benchmarks results will be discussed.

Tags: analytics, analytics zoo, benchmark, big data, cloud, deep learning applications, hybrid cloud, intel, spark

Tag: big data