The right way to implement a mutex with DynamoDB

Temps de lecture : 3 minutes

Last week, I reviewed a pull request from one of my colleagues and found a totally wrong implementation of a mutex lock based on DynamoDB. After pointing it out, I realized there are very few blog posts about the correct implementation of a mutex lock using Amazon DynamoDB, and I personally believe it is a really common use case to share.

In this blog post, I will present the “wrong” implementation, and the actual correct code we have written together during the review.

A bit of theory: optimistic lock vs mutex lock

Optimistic locks are useful when you would like to update a data, assuming that it could have been updated by somebody else at the same time. Basically, optimistic lock work as this:

  1. Record a timestamp marking the transaction’s beginning
  2. Read values, and tentatively write changes
  3. Check whether other transactions have modified data
  4. If there is no conflict, make all changes take effect. If there is a conflict, resolve it.

Mutex locks are intended for MUTual EXclusion. For example, when you have two processes updating the same resource, you can have a single bit telling a process that another is currently updating the resource. Mutex locks work as it:

  1. Read the semaphore value. If value is 0, set to 1, else, fail (lock cannot be acquired)
  2. When lock is acquired, update the data
  3. Set semaphore value to 0 (will release lock)

Although the two locks are similar, we will below present the mutual exclusion algorithm.

An incorrect mutex lock algorithm

Assuming we have a DynamoDB table with ResourceId as hash key, below is the source code I received to review:

import boto3

def lock(my_resource_id):
    dyn_table = boto3.resource('dynamodb').Table('my-lock-table')
    lock = dyn_table.get_item(Key={
        'ResourceId': my_resource_id
    }).get('Item', False)

    if not lock:
        dyn_table.put_item(Item={
            'ResourceId': my_resource_id
        })
        return True
    else:
        return False

What is wrong here ?

First of all, the « lock » operation is here not atomic. Between a get_item and the put_item operation, there could have been another process checking the lock using get_item on his side. Example in workflow below, the two processes will believe to have acquired the lock:

The second problem is intrinsic to DynamoDB databases. Even with one process this algorithm will not behave as expected. The reason is: the get_item operation is eventually consistent. Consequently, it may return a stale item when the lock was actually released, and the process will believe that it cannot acquire the lock.

The mutex lock done right using DynamoDB as backend

Below, is my proposed implementation. It uses the DynamoDB ConditionExpression feature.

import boto3
import botocore.exceptions

def lock(my_resource_id):
    try:
        # Put item with conditional expression to acquire the lock
        dyn_table = boto3.resource('dynamodb').Table('my-lock-table')
        dyn_table.put_item(
            Item={'ResourceId': my_resource_id},
            ConditionExpression="attribute_not_exists(#r)",
            ExpressionAttributeNames={"#r": "ResourceId"})
        # Lock acquired
        return True
    except botocore.exceptions.ClientError as e:
        # Another exception than ConditionalCheckFailedException was caught, raise as-is
        if e.response['Error']['Code'] != 'ConditionalCheckFailedException':
            raise
        else:
            # Else, lock cannot be acquired because already locked
            return False

How does it work ?

Using put_item with ConditionExpression will tell DynamoDB to perform the put operation ONLY IF the condition is true. This operation is atomic.

You can take a look at the ConditionExpression documentation: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.ConditionExpressions.html

In the above source code, I tell DynamoDB to put my item only if there is not already an item with this identifier. If the item exists, an exception is raised and lock is not acquired : probably another process is using the resource. If it worked, the lock is acquired and the item is created, preventing another process to take it.

The unlock works symmetrically, verifying that the resource is currently locked.

The end ?

Developing algorithms with distributed databases is always a challenge. Moreover, because Amazon DynamoDB is so simple to use, we can easily forget about the eventually consistency or basic algorithmic concepts about atomic operations.

If you are reading this article because a colleague pointed something in your source code, do not be mad. I personally love to be wrong, because when somebody have explained me why I mistaken, I have learnt something new.

Commentaires :

A lire également sur le sujet :