Parallel scanning with AWS DynamoDb

Mark Boyd
3 min readNov 3, 2021

Background

There are two ways of retrieving information from a DynamoDB database: querying and scanning. Querying involves specifying values for the partition key (and optionally sort key) of the table in order to retrieve the exact record(s) you want. However, if you want to retrieve records by some other property that is not the partition/sort key, you can do a scan operation. This operation will essentially iterate over all of the records in your table in order to return the records that match your specified filters.

Querying operates on the key fields that are indexed by DynamoDB, so it is very efficient and performant whereas scanning can be very costly (in direct proportion to your table size) and time-consuming. So the recommendation is to only use scanning when querying is not an option.

For further information, see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-query-scan.html.

Optimizing full table scans with parallel scanning

Despite the inefficiency of DynamoDB table scanning, sometimes you have no choice. Recently I needed to write a script that would read all records from DynamoDB table so that they could be migrated to a PostgreSQL RDS database, thus necessitating table scanning.

While the financial costs associated with scanning an entire table (measured in read units) cannot be avoided with such an approach, I was desperate to see if I could improve the total processing time necessary to read an entire DynamoDB table by somehow parallelizing the table reads.

It turns out that DynamoDB has a feature to support parallel scanning of tables. Essentially, parallel scanning allows you to break the scanning of the full table into a specified number of segments, where each segment represents a block of items in the table. Then, you can do the scanning of all the segments in parallel in order to optimize the time necessary to scan the entire table.

The AWS announcement for parallel scanning mentions the new parameters to the Scan API request that are required to use the feature:

- TotalSegments denotes the number of workers that will access the table concurrently.

- Segment denotes the segment of table to be accessed by the calling worker

And the announcement gives a nice overview of how parallel scanning might look in practice:

Let’s say you have 4 workers. You would issue the following calls simultaneously to initiate a parallel scan:

- Scan(TotalSegments=4, Segment=0, …)

- Scan(TotalSegments=4, Segment=1, …)

- Scan(TotalSegments=4, Segment=2, …)

- Scan(TotalSegments=4, Segment=3, …)

Importantly, it is up to you as the initiator of the parallel scanning to handle them via whatever paradigm your language supports for doing operations in parallel (threads, promises, etc).

There is minimal guidance on the optimal value to use for “TotalSegments” when doing parallel scans, with AWS recommending one scan segment per 2 GB of data as a safe starting point. And it is hard to be more precise given that the constraints depend mostly on things controlled by the user (available memory, etc). I used their recommendation as a starting point but after some manual testing I ultimately ended up being much more aggressive than that, using 50 parallel scan segments on a table of 4.2 GB.

Reference implementation for Node.js

Here is the code I wrote to do parallel scanning for the Cumulus project:

https://github.com/nasa/cumulus/blob/59fc0ecf07a4c58cd65ff2a730aeb864c90343d7/packages/aws-client/src/DynamoDb.ts#L162

Keen observers may notice that there is a do/while loop in this code to handle fetching all of the items within a parallel scan segment. This is an important note: parallel scanning will divide the read of your table into the specified number of segments, but depending on your table size, you may not be able to retrieve all the items for a given segment in a single scan query. A single Scan request can only retrieve a maximum of 1 MB of data, so if your individual scan segments exceed 1 MB of data, you will need to do multiple scan operations on that segment to retrieve all of the items.

Additional considerations for record migration

In my case, I was using parallel scanning simply as a means to reduce the amount of time necessary to run my record migration scripts. But the side effect of doing so much parallel scanning and migrating so many records in parallel was that it dramatically increased the load on our serverless RDS database, causing ACU spikes and some record migration failures.

The solution in my case was to tweak the parallelization to achieve the optimum performance without overtaxing our RDS database, but the key takeaway is to be mindful of how the scale of your parallelized reads will affect other systems depending on what you are doing with those records.

Testing & results

In my testing, using 50 parallel scan segments allowed me to read and migrate ~900,000 DynamoDb records in ~45 minutes.

--

--