This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.
Tag: aws s3
Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this.
If we add more Spark jobs across multiple clusters, you could have something like this.
Problem If you have hundreds of external tables defined in Hive, what is the easist way to change those references to point to new locations? That is a fairly normal challenge for those that want to integrate Alluxio into their stack. A typical setup that we will see is that users will have Spark-SQL or … Continued
How do we access AWS S3 data when running Presto in an on-premise environment, how can we do it efficiently to reduce both egress cost and performance runtimes? Alluxio as a local cache for Presto queries against remote AWS S3 data sources As we move toward more and more decoupled environments one of the things … Continued
Register for this tech talk to learn how to run EMR Spark on Alluxio as a distributed file system cache for S3.
Increasingly S3 is being used as a data store for analytical and machine learning workloads. This means that it is very easy to generate a massive amount of get operations and request data from S3. For example: a couple of commands can launch a 1000 node cluster of AWS EMR service with the Spark or … Continued
Introducing S3 and Spark S3 has become the de-facto standard API for digital business applications to store unstructured data chunks. To this end, several vendors have S3-API compatible offerings that allow app developers to standardize on the S3 API’s on-premise, and port these apps to run on other platforms when ready. So, what is S3 and … Continued
TensorFlow is an open source machine learning platform used to build applications like deep neural networks. It consists of an ecosystem of tools, libraries, and community resources for machine learning, artificial intelligence and data science applications. S3 is an object storage service that was created originally by Amazon. It has a rich set of API’s … Continued
Problem Sometimes big data analytics need process input data from two different storage systems at the same time. For instance, a data scientists may need to join two tables one from a HDFS cluster and one from S3. Existing Solutions Certain computation frameworks may be able to connect to storage systems including HDFS and popular cloud … Continued