On-Demand Videos
In this talk, Ojus Save walks you through a demo of how to build AI applications on Zoom. This demo shows you an AI agent that receives transcript data from RTMS and then decides if it has to create action items based on the transcripts that are received.
In this talk, Sandeep Joshi, , Senior Manager at NVIDIA, shares how to accelerate the data access between GPU and storage for AI. Sandeep will dive into two options: CPU- initiated GPUDirect Storage and GPU-initiated SCADA.
Bin Fan, VP of Technology at Alluxio, introduces how Alluxio, a software layer transparently sits between application and S3 (or other object stores), provides sub-ms time to first byte (TTFB) solution, with up to 45x lower latency.
.png)
This DATA ORCHESTRATION SUMMIT session talks about challenges associated with querying diverse data sources at Walmart and how those are tackled using Presto & Alluxio.
How Alluxio caching was leveraged to provide consistent optimized query performance within and across clouds.
Also highlights implementation of critical components for Enterprise acceleration offering such as security integration for fine grained access control, auto-scaling & auto deployment in GCP.
In this panel, creators of open source projects share their stories from why they started the project to the challenges they encountered on the way.
Spark is a widely adopted open source framework that provides a unified interface for analytics and machine learning workloads. Alluxio, originating from the UC Berkeley AMPLab – the same lab as Spark, is an open source data orchestration platform that empowers compute frameworks like Spark by providing stateful caching to enable efficient data sharing between multiple jobs and improving resilience against job failures as well as bringing data together from many different sources, be it remote HDFS or cloud object stores.
Alluxio partnered with IBM to deliver a Spark-based solution to provide fast data analytics. With the integration of IBM Spectrum Conductor, an advanced workload and resource management platform that maximizes hardware utilization to speed results and cut infrastructure costs, Alluxio and IBM delivered a solution that powers leading telecom company’s applications to support 320 million subscribers. In this online meetup, we will present the benefits of the fast analytics stack of Spark on Alluxio and IBM and dive into a leading telecom’s use case of leveraging Spark and Alluxio to process massive amounts of mobile data.
In this online meetup, you will learn about:
- Why the leading companies are moving towards a decoupled compute and storage architecture, and the associated challenges and requirements.
- Why Spark and Alluxio together can solve the challenges and fulfill the requirements
- How leading telecom leverages Spark with Alluxio for fast data processing at scale on top of object store and HDFS
Using “zero-copy” hybrid bursting with Spark to solve capacity problems
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
In this tech talk, we’ll discuss:
- Approaches to burst data to the cloud
- How Alluxio can enable “zero-copy” bursting of Spark workloads to cloud data services like EMR and Dataproc
- How DBS Bank uses Alluxio to solve for limited on-prem compute capacity by zero-copy bursting Spark workloads to AWS EMR
The data ecosystem has heavily evolved over the past two decades. There’s been an explosion of data-driven frameworks, such as Presto, Hive, and Spark to run analytics and ETL queries and TensorFlow and PyTorch to train and serve models. On the data side, the approach to managing and storing data has evolved from HDFS to cheaper, more scalable and separated services typified by cloud stores like AWS S3. As a result, data engineering has become increasingly complex, inefficient, and hard, particularly in hybrid and cloud environments.
Haoyuan Li offers an overview of a data orchestration layer that provides a unified data access and caching layer for single cloud, hybrid, and multicloud deployments. It enables distributed compute engines like Presto, TensorFlow, and PyTorch to transparently access data from various storage systems (including S3, HDFS, and Azure) while actively leveraging an in-memory cache to accelerate data access.
Many organizations are leveraging Hive to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
In this Office Hour we’ll go over:
- Bazaarvoice’s use case leveraging Apache Spark, Hive, and Alluxio on S3
- How to set up Hive with Alluxio such that Hive jobs can seamlessly read from and write to S3
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more
ING Bank is a multinational financial services company headquartered in Amsterdam with over $1 trillion in assets. As a leading bank, we place a great emphasis on cybersecurity. One aspect of this is the Security incident and event management (SIEM), which is the process of identifying, monitoring, recording and analyzing security events or incidents within a real-time IT environment. SIEM requires our data platform to have high and consistent performance, so we use open source technologies Presto and Alluxio for fast SQL analytics in the cloud.
In this online presentation, we are going to present how ING is leveraging Presto (interactive query), Alluxio (data orchestration & acceleration), S3 (massive storage), and DC/OS (container orchestration) to build and operate our modern Security Analytics & Machine Learning platform. We will share the challenges we encountered and how we solved them. Today we run this platform in several different data centers, and we have reduced our 10+ minutes queries to under 10 seconds!
EMR has become a widely used service to run big data analytics in the public cloud. But issues around slow/inconsistent EMR performance due to S3 data lakes creates challenges for organizations.
Alluxio is a data orchestration layer for the cloud that increases performance of analytic workloads running on AWS EMR using S3 as the storage.
Join us for this webinar where we will show you how to set up EMR Spark and Hive with Alluxio so jobs can seamlessly read from and write to your S3 data lake. You’ll see the performance gains with Alluxio in your EMR/S3 stack.
Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
In this Office Hour we go over:
- How to set up EMR Spark with Alluxio such that Spark jobs can seamlessly read from and write to S3
- Compare the performance between Spark on S3 with Spark and Alluxio on S3
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more
Kubernetes is widely used across enterprises to orchestrate computation. And while Kubernetes helps improve flexibility and portability for computation in public/hybrid cloud environments across infrastructure providers, running data-intensive workloads can be challenging.
When it comes to efficiently moving data closer to Spark or Presto frameworks, co-locating data with these frameworks and accessing data from multiple or remote clouds is hard to do. That’s where Alluxio, an open source data orchestration platform, can help.
Alluxio enables data locality with your Spark and Presto workloads for faster performance and better data accessibility in Kubernetes. It also provides portability across storage providers.
In this on demand tech talk we’ll give a quick overview of Alluxio and the use cases it powers for Spark/Presto in Kubernetes. We’ll show you how to set up Alluxio and Spark/Presto to run in Kubernetes as well.
Alluxio 2.0 is the most ambitious platform upgrade since the inception of Alluxio with greatly expanded capabilities to empower users to run analytics and AI workloads on private, public or hybrid cloud infrastructures leveraging valuable data wherever it might be stored.
This release, now available for download, includes many advancements that will allow users to push the limits of their data-workloads in the cloud.
In this tech talk, we will introduce the key new features and enhancements such as:
- Support for hyper-scale data workloads with tiered metadata storage, distributed cluster services, and adaptive replication for increased data locality
- Machine learning and deep learning workloads on any storage with the improved POSIX API
- Better storage abstraction with support for HDFS clusters across different versions & active sync with Hadoop
We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.
In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:
– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.
Alluxio 2.0 Preview Release Deep Dive
We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.
In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:
– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.