This article goes through a simple example to illustrate how Structured Data Management available in the latest Alluxio 2.1.0 release to help SQL and structured data workloads.
Founding Engineer, Alluxio
This article introduces Structured Data Management (Developer Preview) available in the latest Alluxio 2.1.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio.
Notice anything new about our websites? That’s right – we are super excited to launch our new website – Alluxio.io!
As we continue our focus on our open source community, one important item on our mind was to rebuild our website to provide better user experience for our community. To that end, you’ll see lots of changes in the Alluxio web experience.
Impersonation is simply the ability for one user to act on behalf of another user. For example, say user ‘yarn’ has the credentials to connect to a service, but user ‘foo’ does not. Therefore, user ‘foo’ would never be able to access the service. However, user ‘yarn’ can access the service and impersonate (act on behalf of) user ‘foo’, allowing access to user ‘foo’. Therefore, impersonation enables one user to access a service on behalf of another user.
The impersonation feature defines how users can act on behalf of other users. Therefore, it is important to know who the users are.
The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices. This article provides the top 10 tips for performance tuning for real-world workloads when running Spark on Alluxio with data locality giving the most bang for the buck.
we held our first New York City Alluxio Meetup! Work-Bench was very generous for hosting the Alluxio meetup in Manhattan. This was the first US Alluxio meetup outside of the Bay Area, so it was extremely exciting to get to meet Alluxio enthusiasts on the east coast!
The meetup focused on users of Alluxio with different applications from Hive and Presto. As an introduction, Haoyuan Li (creator and founder of Alluxio) and Bin Fan (founding engineer of Alluxio) gave an overview of Alluxio and the new features and enhancements of the new v1.8.0 release.
Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their case study, they described how Alluxio improved their system architecture, and mentioned that some existing Spark jobs would slow down or would never finish because they would run out of memory. After using Alluxio, those jobs were able to finish, because the data could be stored in Alluxio, instead of within Spark.
In this blog, we show by saving RDDs in Alluxio, Alluxio can keep larger data sets in-memory for faster Spark applications, as well as enable sharing of RDDs across separate Spark applications.
Alluxio helps organizations handle their big data by providing a unified view of all of the data in your enterprise – on premise, in the cloud, or hybrid. Applications access data using a standard interface to a global virtual namespace. Alluxio also employs a memory-centric architecture to enable data access at memory speed. With the combined unification and performance benefits, Alluxio can effectively provide big data federation for organizations by acting as a virtual data lake.
Using Alluxio, data can be shared between pipeline stages at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this can greatly increase the performance. Alluxio Enterprise Edition (AEE) introduces Fast Durable Writes, a feature which enables low latency and fault-tolerant writes. In this article, we describe the Fast Durable Writes feature, and explore how Alluxio can be deployed and used with a data pipeline.