Alluxio Block Allocation Policy Explained

Xi Chen, one of the Top 100 Alluxio open source project contributors, explains the block allocation policy of Alluxio at the code level.

Overview

Alluxio workers are responsible for managing local resources, and they store data as blocks. Users can allocate different storage tiers as the resources for Alluxio workers, including MEM/SSD/HDD, which are further composed of directories. 

How does an Alluxio worker decide which directory to put a block in when a user reads or writes data through Alluxio? In this article, we analyze the policies of block allocation from the source code.

Introduction to Alluxio’s Block Allocation PolicY

Alluxio uses block allocation policies to define how to allocate new blocks across multiple storage directories (in the same tier or across different tiers). The allocation policy defines which storage directory to allocate the new block in.

Alluxio supports having multiple storage tiers(typically each tier stands for a storage medium like MEM or SSD), and in each tier, you can define multiple directories (for example, multiple SSD volumes as the SSD tier).

Currently, there are three types of block allocation policies in Alluxio.

  1. alluxio.worker.block.allocator.GreedyAllocator: It loops from the top tier to the lowest tier, trying to put the new block into the first directory that can contain the block. This policy is only for demonstration purposes. In production, you should choose a more complicated policy for better allocation decisions.
  2. alluxio.worker.block.allocator.MaxFreeAllocator: Start trying from tier 0 to the lowest tier. In each tier, try to allocate the block to the storage directory that currently has the most availability. If no directory in the current tier can serve the new block, go to a lower tier.
  3. alluxio.worker.block.allocator.RoundRobinAllocator: Start trying from tier 0 to the lowest tier. On each tier,  try to allocate the new block into a directory following the Round Robin order. If no directory in the current tier can serve the new block, go to a lower tier.

The default policy is MaxFreeAllocator, which can be configured by worker property alluxio.worker.allocator.class.

Source Code Explanation

allocateSpace

allocateSpace is responsible for allocating a block to a storage directory.

When reading or writing data, the Alluxio worker will request a storage tier for the block to store the data according to the block allocation policies. allocateSpace will first try to find the location (starting from tier 0) in the location specified in the incoming options. If there is not enough space found in the specified location, it will try to find it in lower storage tiers.

private StorageDirView allocateSpace(long sessionId, AllocateOptions options)
      throws WorkerOutOfSpaceException, IOException {
      while (true) {
        if (options.isForceLocation()) {
            //…
        } else {
            //…
            dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),
                options.getLocation(), allocatorView, false);
            if (dirView != null) {
              return dirView;
            }
            dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),
                BlockStoreLocation.anyTier(), allocatorView, false);
            if (dirView != null) {
              return dirView;
            }
      }
        //…
}

As you can see from the above source code, the storage directories selected by allocateSpace are dynamically calculated. However, at some point, data needs to be available to write to the specified storage directory or tier, so allocateSpace also supports evict blocks to free up enough space in a storage directory to accommodate new data. We’ll talk more about it in the later section.

private StorageDirView allocateSpace(long sessionId, AllocateOptions options) {
    StorageDirView dirView;
//…
    while (true) {
      if (options.isForceLocation()) {
        dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),
            options.getLocation(), allocatorView, true);
        if (dirView != null) {
          return dirView;
        }
        freeSpace(sessionId, options.getSize(), options.getSize(), options.getLocation());
        dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),
                      options.getLocation(), allocatorView.refreshView(), true);
                //…
        }
      }
      //…
      } else {
        //…
      }
      return dirView;
    }
  }

Allocator is the interface for the allocation policy of Alluxio managed data. 

GreedyAllocator, MaxFreeAllocator, and RoundRobinAllocator all implement the Allocator interface to define the policy.

Allocator

public interface Allocator {
 
​  //…
  StorageDirView allocateBlockWithView(long sessionId, long blockSize, BlockStoreLocation location,
        BlockMetadataView view, boolean skipReview);
}

allocateBlockWithView allocates a block from the given block store location under a given view. There are three main parameters.

  • blockSize: the size of block in bytes.
  • location: the desired location in block store, tier 0, or any tier.
  • skipReview: for an allocation decision, there’s a review step to reject bad decisions. This flag specifies whether the review should be skipped, which means the decision is final.

Let’s take the default alluxio.worker.block.allocator.MaxFreeAllocator as an example. It finds storage directories in the current tier, then it checks all the storage directories one by one and returns the largest one. 

private StorageDirView allocateBlock(long sessionId, long blockSize,
      BlockStoreLocation location, boolean skipReview) {
 
​    if (location.equals(BlockStoreLocation.anyTier())) {
      for (StorageTierView tierView : mMetadataView.getTierViews()) {
        candidateDirView = getCandidateDirInTier(tierView, blockSize, BlockStoreLocation.ANY_MEDIUM);
        if (candidateDirView != null) { // Review
          if (skipReview || mReviewer.acceptAllocation(candidateDirView)) {
            break;
          }
        }
      }
    }
     //…
    return candidateDirView;
  }
 
 
​private StorageDirView getCandidateDirInTier(StorageTierView tierView,
      long blockSize, String mediumType) {
    StorageDirView candidateDirView = null;
    long maxFreeBytes = blockSize - 1;
    for (StorageDirView dirView : tierView.getDirViews()) {
      if ((mediumType.equals(BlockStoreLocation.ANY_MEDIUM)
          || dirView.getMediumType().equals(mediumType))
          && dirView.getAvailableBytes() > maxFreeBytes) {
        maxFreeBytes = dirView.getAvailableBytes();
        candidateDirView = dirView;
      }
    }
    return candidateDirView;
  }

As you can see from the source code, after finding the candidateDirView, it needs to go through a review process to finally decide which storage directory to return. Let’s take a closer look at the Review policies next.

Block Allocation Review Policies

Alluxio uses block allocation review policies to complement allocation policies. Review brings additional restrictions (e.g., SoftLimit/HardLimit) and randomness to the block allocation policy. The Reviewer works together with the Allocator.

Currently, Alluxio implements two Review policies.

  1. alluxio.worker.block.reviewer.AcceptingReviewer: This reviewer accepts every block allocation, which is equivalent to no review.
  2. alluxio.worker.block.reviewer.ProbabilisticBufferReviewer: Based on the available space in each storage directory, rejects the attempts to put new blocks into this directory probabilistically. This is the default behavior.

The client writes a block with a stream. So, instead of requesting the whole block size from the worker storage, the client requests one buffer size at a time until the stream is completed. This is because, in many cases, the real block size is smaller than a block size limit (64MB by default). Therefore, instead of allocating the whole 64MB, we start small and let the blocks expand on the way.

One problem is that when the storage dir has less than 64MB left, a new block may be allocated to this directory because the initial request is smaller than the vacancy. As the block expands, the vacancy will be filled up, and one other block has to be evicted from the directory so that this block can finish writing. This brings in additional evictions, and the new block should have been allocated to some other directories.

This is why ProbabilisticBufferReviewer was introduced. As the vacancy shrinks in a directory, the directory will have a higher chance of rejecting new blocks, so that existing blocks may expand without evicting others.

Let’s take a look at the source code of the default ProbabilisticBufferReviewer.

public class ProbabilisticBufferReviewer implements Reviewer {
    //…
    double getProbability(StorageDirView dirView) {
        //…
        if (availableBytes > mSoftLimitBytes) {
          return 1.0;
        }
        if (availableBytes <= mHardLimitBytes) {
          return 0.0;
        }
 
        double x = capacityBytes - availableBytes;
        double k = 1.0 / (mHardLimitBytes - mSoftLimitBytes); // If HardLimit = SoftLimit, then we would have returned in the previous if-else
        double b = (capacityBytes - mHardLimitBytes + 0.0) / (mSoftLimitBytes - mHardLimitBytes);
        double y = k * x + b;
        return y;
      }
}

ProbabilisticBufferReviewer

  • If the remaining space in the current storage directory is less than mHardLimitBytes, then it will return 0, indicating that it did not pass the review.
  • If the remaining space in the current storage directory is larger than mSoftLimitBytes, then it will return 1, which means it passed the review.
  • If the remaining space size is between mHardLimitBytes and mSoftLimitBytes, then a value between (0, 1) is returned.

freeSpace

As mentioned above, allocateSpace supports the eviction of blocks to free up space in a storage directory to accommodate new data. How does Alluxio decide which block to evict?

Block eviction is done by freeSpace. If there is not enough space, the block will be evicted one by one until space is freed up.

public synchronized void freeSpace(long sessionId, long minContiguousBytes,
      long minAvailableBytes, BlockStoreLocation location) {
    Iterator<Long> evictionCandidates = mBlockIterator.getIterator(location, BlockOrder.NATURAL);
    while (true) {
      //…
      if (contiguousSpaceFound && availableBytesFound) {
        break;
      }
 
​      if (!evictionCandidates.hasNext()) {
        break;
      }
      long blockToDelete = evictionCandidates.next();
      if (evictorView.isBlockEvictable(blockToDelete)) { // some blocks will not be evicted
        try {
          BlockMeta blockMeta = mMetaManager.getBlockMeta(blockToDelete);
          removeBlockFileAndMeta(blockMeta);
          //…
      }
    //…
  }

There are some blocks that will not be evicted.

public boolean isBlockEvictable(long blockId) {
    boolean pinned = isBlockPinned(blockId);
    boolean locked = isBlockLocked(blockId);
    boolean marked = isBlockMarked(blockId);
    boolean isEvictable = !pinned && !locked && !marked;
    if (!isEvictable) {
      LOG.debug("Block not evictable: {}. Pinned: {}, Locked: {}, Marked: {}", blockId, pinned,
          locked, marked);
    }​
 
    return isEvictable;
  }

There are currently two policies for eviction.

  • alluxio.worker.block.annotator.LRUAnnotator: Annotates the blocks based on least-recently-used (LRU) order. This is Alluxio’s default annotator.
  • alluxio.worker.block.annotator.LRFUAnnotator: Annotates the blocks based on least-recently-used (LRU) and least-frequently-used (LFU) orders with a configurable weight. The weight controls if the behavior is closer to LRU or LFU.

The default policy is LRUAnnotator, which can be configured with the property alluxio.worker.block.annotator.class.

Let’s take a look at the source code of LRUAnnotator.

public class LRUAnnotator implements BlockAnnotator<LRUAnnotator.LRUSortedField> {
  private static final Logger LOG = LoggerFactory.getLogger(LRUAnnotator.class);​
 
  private AtomicLong mLRUClock;
 
​  @Override
  public BlockSortedField updateSortedField(long blockId, LRUSortedField oldValue) {
    long clockValue = mLRUClock.incrementAndGet();
    return new LRUSortedField(clockValue);
  }
 
​  /**
   * Sorted-field for LRU.
   */
  protected class LRUSortedField implements BlockSortedField {
    private Long mClockValue;
 
​    private LRUSortedField(long clockValue) {
      mClockValue = clockValue;
    }​
 
    @Override
    public int compareTo(BlockSortedField o) {
      Preconditions.checkState(o instanceof LRUSortedField);
      return mClockValue.compareTo(((LRUSortedField) o).mClockValue);
    }
    //…
}

We can see that LRUAnnotator internally identifies the access order of each block by a timestamp represented by a monotonically increasing AtomicLong. The larger the AtomicLong is, the more recent it is accessed.

About the Author

Xi Chen is one of the Top 100 contributors to the Alluxio open source project.