In this talk we will focus on how Tachyon can help improve big data analytics (ad-hoc query) efficiency within Baidu.
Currently within Baidu, we have a production Tachyon cluster with 100 nodes and over 2PB of storage space – this cluster mainly serves as the cache layer for our big data analytics engine. In this talk, first we introduce the big data analytic infrastructure within Baidu. Then, we explain why we started using Tachyon a few months ago, as well as the problems encountered when we started using Tachyon. Next, we delve into the details of how Tachyon help accelerate our Big big data analytics pipeline at its current state. At the end, we discuss what new features we want to see and the plan to scale further.
Next generation big data engines (Apache Spark, Tez, etc.) are famous for their performance boost within memory computing. However, current memory size is far from enough to host a data set. Then NVM emerged to respond to this need. However, how to integrate NVM to support a modernized big data system is a challenge. For example, to handle a bunch of GC overheads, and refactor your system API, etc. It does bring benefits for in-memory and real-time computation, but also raises new questions about memory management in big data.
In this talk, we present our efforts to make a tiered store in Tachyon, which provided a software solution for next-gen data center platforms with NVM. It plays transparently to the end user but brings better performance for real-world applications.