Example data: Cumulus granules/files data
Recently on Cumulus, we wanted to re-examine how we’re storing our granules/files data. For context, on Cumulus a granule is a set of related data science files, so granules are composed of files. The granules and files are stored in separate tables. Granules to files have a one-to-many relationship.
Cumulus is currently using DynamoDB to store this data, which is a NoSQL database, and as a result there is no referential integrity between the file records and the granule record they belong to. This means that when you write a file record to DynamoDB, there’s nothing at the database level enforcing that it must refer to a valid granule record (by that record’s ID usually).
DynamoDB vs RDS
While my first instinct for storing interrelated data is to use RDS/SQL databases, you can model relational data in DynamoDB using a technique called “Global secondary index overloading”. See the section titled “GSI overloading” in this article for an explanation of this technique.
Comparing DynamoDB using GSI overloading to using RDS, these were some takeaways:
- Using GSI overloading/DynamoDB requires having a very good understanding of the queries that you will run against your data up front so that you optimize your model/indexes to handle them performantly
- DynamoDB has some support for using transactions (e.g. writing related records to the database at the same time in a transaction that either succeeds/fails all together) that you can use to encourage data integrity, but there is a limit on how many items can be included in a transaction
- RDS has referential integrity of data by default and has robust transaction support
- DynamoDB, being part of the “serverless” suite of tools, will scale with your data by default
The biggest takeaway of using DynamoDB vs RDS for relational data is:
DynamoDB can handle/mimic relational data queries when the queries are predictable, but it is not at all the right solution for ad-hoc/unpredictable queries
In other words, the efficiency of using DynamoDB/GSI overloading depends on your queries being well understood up front so that the model/indexes can be designed to accommodate them. However, if you throw a query at DynamoDB that it was not designed to handle well, it may require table scanning and be very slow/costly. With RDS/SQL, the schema and data entity attributes are well defined, so supporting a new query on a given field can usually be achieved efficiently and without much difficulty (by adding new indexes, etc).