This is a guest blog by Chengzhi Zhao with an original blog source. Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. … Continued
What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark – Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab – now RISElab).
During the past several years, Spark has significantly changed the landscape of big data computing. It improves performance of various applications dramatically. However, in certain Spark use cases, the bottleneck is in the I/O stack. In this talk, we will introduce Tachyon, a distributed memory-centric storage system. In addition, we will talk about several production use cases where Tachyon further improves Spark applications’ performance by orders of magnitude.
A few months ago, Baidu deployed Alluxio to accelerate its big data analytics workload. Bin Fan and Haojun Wang explain why Baidu chose Alluxio, as well as the details of how they achieved a 30x speedup with Alluxio in their production environment with hundreds of machines. Based on the success of the big data analytics engine, Baidu is currently expanding the Alluxio and Spark infrastructure to accelerate other applications, such as machine learning.
In this talk, Haoyuan Li, co-creator of Tachyon (and a founding committer of Spark) and CEO of Tachyon Nexus will explain how the next wave of innovation in storage will be driven by separating the functional layer from the persistent storage layer, and how memory-centric architecture through Tachyon is making this possible. Li will describe the future of distributed file storage and highlight how Tachyon supports specific use cases.
Throughout our four-year history, Scala and Scale By the Bay is leading the way on evangelizing and understansing modern software architectures. We have the best set of them here, including Akka, Kafka, Spark, Finagle, Lagom, and so on. How do they come together in a SMACK / MIND Stack? What are the best practices to follow and pitfalls to avoid? This panels of experienced practitioners will discuss and illuminate it all.
Using intermediate APIs means developers can learn just one framework and still access features offered by different technologies. It means writing job logic only once and being able to test it easily on a new underlying service with no effort. Not only is modularity a win for users but it means creators of execution frameworks and storage systems can focus on performance and capability without having to worry about API maintenance.
Using Alluxio to Improve Spark & Hadoop HDFS System Performance and Reliability [Chinese]
In this presentation, William Callaghan will focus on the challenges faced and lessons learned in building a human-in-the loop cyber threat analytics pipeline. They will discuss the topic of analytics in cybersecurity and highlight the use of technologies such as Spark Streaming/SQL, Cassandra, Kafka and Alluxio in creating an analytics architecture with missions-critical response times.