Following part 1 and part 2, this final blog of the series discusses some design decisions and details, as well as certain future work.
Discussions and Future Work
Why not exactly once delivery for pub/sub?
As we know, exactly once message delivery for pub/sub would greatly simplify our design and there do exist many powerful systems such as Kafka and RabbitMQ that are created precisely to solve this problem. An advantage of using these systems would be that faults may have less of an impact on performance. For example if a connection was lost at a subscriber, on reconnection the system could just continue where it left off.
Unfortunately maintaining these systems themselves can be a very complex task, which requires asking questions like, how many nodes physical machines do you allocate, how many times do you replicate a message, how long do you keep a message, do you block operations when you cannot publish messages due to connection issues, etc. And in the end a fault recovery mechanism will likely still be needed, resulting in an even more complex design.
(Note that to ensure eventual consistency, we actually only need at least once delivery of messages, as delivering a message multiple times would only have a negative impact on performance and not consistency, but still the majority of difficulties remain in this case.)
Scaling beyond 20 Alluxio clusters or dealing with frequent faults
For future work we may want to support scaling to hundreds of Alluxio clusters, but scaling from 20 to hundreds of clusters may require different design considerations, first because we may expect failures to be more frequent, and second because the design may result in significant overhead on the masters.
As described previously, the frequency of faults will reduce the performance to be similar to that of time based synchronization. With hundreds of clusters we may expect a network or master node failure to happen fairly frequently. (Note that this also depends on the configuration, as failures will only affect the clusters that mount an intersecting UFS path related to the failure. Thus, if clusters mostly mount disjoint UFS paths, then this may be less of an issue.) Furthermore, if all clusters mount intersecting paths then they will have to maintain subscriptions to all other clusters, and a publication would have to send hundreds of messages.
In this case we may want to integrate a reliable pub/sub mechanism such as Kafka or RabbitMQ, but instead of changing the overall system design these would simply replace the point to point subscriptions. Failures would still be expected and clusters would recover in the same way by marking the intersecting UFS paths as needing synchronization. Only the reliable pub/sub mechanism would hide many of the failures from Alluxio. For example if the mechanism was to reliably store the last 5 minutes worth of messages, only a failure lasting longer than that would need to be recovered from using the original method. Furthermore, these systems are able to scale independently of the number of Alluxio clusters by adding more nodes to them as necessary. Still, using and maintaining these systems creates a large amount of overhead which may only be worthwhile in certain configurations.
Some observations about consistency
While the previous discussions have shown the basic ideas for ensuring eventual consistency, there are a few important details that were not detailed.
First, the invalidation messages must be published after the modification to the UFS has been completed and, second, the UFS must ensure strong consistency at the level of linearizability or external consistency (consistency in S3). If either of these conditions are not satisfied then when the invalidation is received at a subscriber and it performs a synchronization, the cluster may not observe the most up to date version of the file. Third, if a cluster loses connection to the cross cluster master, and then reestablishes it later, the cluster must also go through the failure recovery process, as there may have been an external cluster that mounted and modified the path during the connection interruption.
Publishing the full metadata
As described, the invalidation messages published only contain the path that has been modified. Alternatively these messages could include the updated metadata for the path, thus avoiding a synchronization on the subscribing cluster. The reason this is not done is that there is no general way to know which version of the metadata is the most up to date version.
Consider for example two Alluxio clusters C1 and C2 that update the same file on the UFS. At the UFS the update from cluster C1 happens before the update from cluster C2. Both clusters then publish their updated metadata to a third cluster C3. Consider that due to network conditions the message from C2 arrives before C1. C3 now must know that it should discard the update from C1 as it already has the most up to date version of the metadata. Of course this can be done for example if the metadata contains versioning information, but unfortunately we do not have a general way to do this for all UFSs supported by Alluxio. As a result, C3 will still need to perform a metadata synchronization with the UFS to get the most up to date version directly from the source of truth.
Subscribing to notification services
Certain UFSs expose notification services that allow the user to know when a file has been modified, for example Amazon SNS and HDFS iNotify. For such UFSs, it may be preferable for the Alluxio Clusters to subscribe to these services instead of Alluxio clusters. This would give the advantage of supporting writes to the UFS that do not go through Alluxio. Again the system design would remain the same, just that instead of subscribing to other Alluxio clusters, a subscription would be made to one of these services.
Note that Alluxio also provides the ActiveSync feature for HDFS allowing the metadata to remain in sync with the underlying UFS. This differs from the Cross Cluster synchronization mechanism as ActiveSync performs synchronizations as files are updated, while Cross Cluster sync only performs synchronizations when files are accessed.
This blog series has described cases where it may be beneficial to run multiple Alluxio clusters, as well as the processes used by Alluxio to keep clusters in sync with the mounted UFSs, both using time based synchronization and the cross cluster synchronization feature. More details on how to deploy cross cluster synchronization can be found here.