Today when we create a Hive table, it is a common technique to partition the table across different values and ranges to improve query performance and reduce maintenance cost. However, Hive can not access a single table directly using a single query with the data of this Hive table across different mediums of storage and … Continued
Learn intermediate and advanced compute, storage, and cloud concepts.
Introducing S3 and Spark S3 has become the de-facto standard API for digital business applications to store unstructured data chunks. To this end, several vendors have S3-API compatible offerings that allow app developers to standardize on the S3 API’s on-premise, and port these apps to run on other platforms when ready. So, what is S3 and … Continued
TensorFlow is an open source machine learning platform used to build applications like deep neural networks. It consists of an ecosystem of tools, libraries, and community resources for machine learning, artificial intelligence and data science applications. S3 is an object storage service that was created originally by Amazon. It has a rich set of API’s … Continued
Problem It becomes increasingly more popular among data scientists to train models based on frameworks like TensorFlow on a local server or cluster while using remote shared storages like S3 or Google Cloud Storage to store a massive amount of the input data. This stack provides high flexibility and cost efficiency, especially requires no dev-ops … Continued
Problem Sometimes big data analytics need process input data from two different storage systems at the same time. For instance, a data scientists may need to join two tables one from a HDFS cluster and one from S3. Existing Solutions Certain computation frameworks may be able to connect to storage systems including HDFS and popular cloud … Continued
Increasingly S3 is being used as a data store for analytical and machine learning workloads. This means that it is very easy to generate a massive amount of get operations and request data from S3. For example: a couple of commands can launch a 1000 node cluster of AWS EMR service with the Spark or … Continued
Alluxio is a data orchestration system which provides data locality with intelligent multi-tiering. The replication parameters are easily configured and once done, Alluxio handles replication transparently to the requesting compute framework. As always, there’s no changes required by the end user, it’s transparent: In the above diagram, data is stored in RAM, SSD, or HDD. … Continued