Note: I wrote this post in some other blog on June 2018, then moved it over here.
A lot of the work I’ve done for the past two months has revolved around one huge DynamoDB table and a few smaller tables complementing the huge one. I thought I’d share some of what I learned in the process.
I’m not going to be talking about the basics of DynamoDB (eventual consistency, avoiding scans and hot partitions, keeping items under 1KB, etc.). For the most basic advice, you should check out the DynamoDB docs.
Spoiler alert: the main takeaway from this post will be the following.
“DynamoDB will not scale unless you know what you’re doing from the very beginning.”
A lot of people have mentioned this already 1 2 3. Hot partitions
are probably the biggest problem here, but I’m not going to talk about that
(yet).
Maximum query return size
Queries return up to 1MB of data (and this is before any filtering is done).
Now, imagine you’re building some kind of social network where people post photos. You’ll save the following data.
- Entity: Photo
UserId
Timestamp
– moment when the photo was added to the database.Thumbnail
PhotoLink
– something to help you retrieve the actual photo stored in S3.Likes
– number of likes/shares/upvotes/whateverTitle
Description
And your most important queries will be to get the last few photos by a
specific user.
Now, if you don’t know what you’re doing and decide to believe in the allegedly seamless scalability and ease of use of DynamoDB, you may consider having a single DynamoDB table with UserId
as the partition key and Timestamp
as the sort key, and having everything else as attributes. That way you can make a Query with Limit=K
and ScanIndexForward=false
to get the last K
photos of a user.
Of course, DynamoDB will overwrite any items with the same UserId
and Timestamp
, which is something you probably don’t want to happen. But you can fix that with a fixed-size random postfix to the Timestamp
or something of the sort. Let’s ignore that in the name of simplicity.
Now suppose you also want to add a feature where users can see their most-liked photos. This is where the 1MB limit in the query return size kicks in. You could implement it by querying for all photos by a user and sorting them by Likes
in-memory. But, if the number of photos by a user grows so that you have more than a few MB of data for a single user, you’ll have to get that data in several sequential requests with 1MB responses each, instead of a single query, and that will degrade the user experience significantly.
In the case I’ve been working on, there was no noticeable difference between getting 0.1MB of data (i.e. setting a pretty low Limit
) or 0.9MB of data in a query, but when we went over 1MB and had to do 2 sequential requests, the time pretty much doubled, as expected. This means sequential reads are a no-go if you’re striving for fast response times.
Global Secondary Indexes are incredibly expensive
To do that query efficiently you’ll have to add a Global Secondary Index with UserId
as the partition key and Likes
as the sort key which essentially doubles the cost of your table.
So why not adding a local secondary index instead? First of all, that won’t save you a lot of money if your expenses come mostly from write capacity. But there’s something worse.
ItemCollections are capped at 10GB
If you’re not very aware of this restriction, you could end up in a world of
pain.
When you have a local secondary index, you can’t have a collection of items with the same value in the partition key exceeding 10GB in total. In other words, applying the concept to this case, you can’t have more than 10GB of data for a single user (i.e. a collection of items with the same value in UserId
).
In this case, it’s fairly unlikely normal users will ever reach that limit, but a bot posting every 30 seconds could get there quite fast (depending on the item sizes). And in other applications the limit could be very much within reach.
Conclusions
You could have thought that the entity we were trying to model (Photo) required simple enough queries that it had to be trivial to model with DynamoDB, but you would’ve thought wrong. If you need to make queries that are at all more complex than just retrieving a single entry given a key, you should not take the decision of using DynamoDB lightly.