The big data stack started with just two APIs, one for processing (Hadoop/MapReduce) and one for storage (HDFS/GFS), and now there are many more. It seems every year there is a new project to get excited about, but it also means new APIs, re-writing pipelines, and complex storage logic if you want to take advantage of it. The cost of this is driving users toward intermediate APIs that separate the interface and execution/implementation. In this talk we discuss what these abstractions should look like, how they will impact the industry and two projects that embody them: Beam and Alluxio
Beam, a job description layer, sits atop popular execution frameworks, including Spark and Flink. Writing a pipeline in Beam means it’s portable across these execution frameworks. It unifies batch and stream, on-premise and cloud and big and small data processing.
Alluxio, a distributed memory-centric virtual file-system, accepts data from popular execution frameworks as well as popular storage and file-systems. It offers a universal namespace and tiered logic at memory speeds.
Using intermediate APIs means developers can learn just one framework and still access features offered by different technologies. It means writing job logic only once and being able to test it easily on a new underlying service with no effort. Not only is modularity a win for users but it means creators of execution frameworks and storage systems can focus on performance and capability without having to worry about API maintenance.
Eric is Product Manager at Google on Cloud Dataflow. He works closely with Beam committers and is a minor contributor. Previously he was at Amazon Web Services on EC2. He is also on the project management committee for Alluxio and a minor contributor. He studied engineering at the University of Utah and Business at Harvard.
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio
Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called “Data Lake” where raw data is stored and periodically ingested into a distributed file system such as HDFS.
Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.
We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.
In this talk we will:
- Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet
- Explain the iterative exploratory process and advantages of adopting functional programming
- Make a crucial analysis on the issues faced with the existing methodology
- Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds
- Discuss some future improvements to the overall architecture
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications.
His main expertise is on building production-oriented machine learning systems.
Co-author of the Professional Manifesto for Data Science (datasciencemanifesto.com), founder of the Data Science Milan Meetup group and former speaker at Spark Summit Europe 2015.
He loves evangelising his passion for best practices and effective methodologies amongst the community.
Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
On-demand compute clusters are often used to save the cost of running and maintaining a continuous cluster for the sake of ad-hoc analysis. Such clusters also provide significant cost savings in storage, since data can be stored in a much cheaper medium, such as object storage. However, one critical downside which prevents on-demand compute clusters from becoming the norm for sporadic data analytics is the lack of high performance. Without co-locating compute and storage, queries and analysis may take unacceptably long periods of time, greatly reducing the value of gathering such insights.
To address this limitation, Alluxio is used as a lightweight data access layer on the compute nodes to bring performance up to memory speeds without requiring a long running cluster. This talk will summarize why Alluxio’s architecture makes it a perfect fit for completing the on-demand cluster puzzle.
Calvin Jia is the top contributor to the Alluxio project and one of the earliest contributors. He started on the project as an undergraduate working in UC Berkeley’s AMPLab. He is currently a software engineer at Alluxio. Calvin has a BS from the University of California, Berkeley.