Azure Cosmos DB is Microsoft’s hot new managed database solution. It promises predictable performance and turnkey global distribution on top of a laundry list of impressive metrics backed by Service Level Agreements (SLAs). The “managed” part is very important. It means you don’t have to manage virtual machines, freeing you from tasks like provisioning, applying security patches, and scripting out software upgrades and complex scaling strategies. In other words, it raises the magic carpet you stand on when you’re coding to an even higher level of abstraction (into the clouds if you will), brushing away all of this undifferentiated heavy-lifting into the purview of the cloud provider. As a developer, this means you get to focus on business priorities and delivering value to stakeholders.
My team was managing a NoSQL database running on a few beefy virtual machines in Azure. In fact, our entire stack was originally deployed across Azure VMs, placing us at the tier of abstraction commonly referred to as Infrastructure as a Service (IaaS). We have been experiencing the overhead of the day-to-day operations for over a year now, fielding incoming feature requests with one hand while holding up the infrastructure with the other.
Earlier this year, we decided to make the move to managed services for our entire stack and landed on Cosmos DB for our target database. In addition to being managed, its built-in monitoring and automatic indexing of all document fields were appealing features. Additionally, amongst four others, it exposes a MongoDB API which closely mirrors the NoSQL interface we moved away from. Last week we finished our migration, so now we’re 100% on the Cosmos DB train in production.
Going into this migration, it wasn’t all butterflies and rainbows. We knew Cosmos DB wouldn’t solve everything. However, we’ve definitely hit a few unexpected snags that didn’t come to light during our initial research and prototyping phase. Here are 5 challenges we had to overcome during our migration:
1. Collections “as code” is not supported
Infrastructure as Code (IaC) is the idea that infrastructure can be represented in code and benefit from the same version control practices as regular software. If a server goes down, you’re not stuck up the creek without a paddle. You can stand up a new server and apply your declarative specifications to bring it to the correct state. IaC is current best practices, and we strive to adhere to it on our project.
In Azure, IaC often comes in the form of Azure Resource Manager (ARM) templates. While you can deploy the Cosmos DB account via an ARM template, there is no way to specify the collections as code. This can lead to interesting problems in deployment pipelines.
For instance, after the template deployment step, you have to use runtime SDKs like the Azure CLI to configure the collections within your Cosmos DB account. This proved complex for us because our Cosmos DB firewall policy has IP whitelisting enabled, and our VSTS build agent’s IP address is always changing. To solve this in our deployment pipeline, we had to dynamically whitelist the build agent’s IP, configure the collections, and then remove the build agent’s IP address from the whitelist.
Ultimately, we achieved IaC by coming up with our own JSON representation of collections and writing an idempotent script to establish them using the Azure CLI. This works, but it’s more complicated than a first class ARM solution would be. Despite much demand from the community, Microsoft seems content to leave collections out of the ARM templates, so this is something we’ll have to live with for the foreseeable future.
2. Firewall updates don’t immediately take effect
With Cosmos DB’s firewall settings, you can whitelist the public internet, all of Azure, or specific IP addresses. In all cases, a client still needs credentials to successfully connect, but the firewall is an added layer of defense that can prevent most hackers from even making it that far. Given our security requirements, we chose to whitelist specific IPs.
As mentioned above, we found it necessary to dynamically add the IP address of the build agent in our deployment pipeline to enable it to configure the Cosmos DB collections. After the build agent finishes this task, we remove its IP address from the whitelist.
Additionally, we deploy an Azure Search resource alongside our Cosmos DB account. Later in our pipeline, when we remove the build agent’s IP address, we also add the IP of our Azure Search account so that our search indexers can successfully crawl over our Cosmos DB collections.
Unfortunately, after updating the firewall policy, there is an indeterminate delay before the new firewall configuration actually takes effect. It seems to take longer for Azure Search to get whitelisted than our build agent. As a result, we found it necessary to inject sleep statements of 6 minutes after whitelisting the build agent and 8 minutes after whitelisting the search account before proceeding further. Otherwise, the subsequent operations of configuring collections or creating the search indexer would fail because the Cosmos DB firewall would block access.
3. You are at the mercy of your busiest partition
We thought if we provisioned a collection with 50k request units (RUs; 1 RU = 1 Kb/sec) and partitioned it, then we would be covered as long as the sum of throughput across all partitions was less than 50k RUs. This is not the case.
We found out the hard way that Cosmos DB would create 5 partitions, each with 10k RUs of provisioned throughput in this scenario. If any single partition exceeded its share of 10k RUs, then that partition would get rate limited until we scaled up the entire collection’s throughput. This hurt because the other partitions were seeing less traffic and didn’t need to be scaled up. When considering whether to partition your collection, try to come up with a shard key that will spread traffic evenly across your partitions.
4. Querying partitioned collections is limited
Partitioned collections kept on delivering in the surprises category. It turns out, Cosmos DB restricts the types of queries you’re allowed to do on partitioned collections. We had a partitioned collection called ‘responses’ that I was trying to empty in a lower environment. Here’s what happened:
In pure MongoDB, this would not have been an issue. And this error continued to plague us later when doing count queries and upserts on partitioned collections. As a result, we’ve decided that partitioned collections should be the exception, not the rule. They can still be useful when trying to optimize performance, but be sure to weigh the benefits with the costs outlined here. Other developers have surfaced this issue in the Cosmos DB forum, and Microsoft has marked it as planned work. With the SQL API, you can bypass this restriction simply by setting a flag to enable cross partition queries. The Mongo API needs access to a similar flag or for this to just work out of the box.
5. Time-To-Live (TTL) and Unique Indexes vary slightly from Mongo DB counterparts
Microsoft is very transparent about the subset of the Mongo API implemented by Cosmos DB. It is outlined here. Nevertheless, some things may still catch you off guard. Be sure to try things out rather than take them for granted. For instance, we learned that Time-To-Live (TTL) collections are supported by Cosmos DB in a limited fashion. In Mongo, you can create a TTL index on any field in a document and specify an “expireAfterSeconds” option to invalidate the document some number of seconds after the timestamp value held in the indexed field. This is much more flexible than the Cosmos DB implementation, which originally only allowed you to specify a TTL at the collection level on the invisible “_ts” field that tracks when a document was last modified. Recently, Cosmos DB has added support for per document TTL, but this is a preview feature and requires each document to have a “ttl” key.
Another mismatch surfaced when working with Unique Indexes. In Cosmos, a unique index can only be created on an empty collection. As a result, unique indexes are something you’ll want to think about up front to avoid costly nuke and pave scenarios later.
As an aside, the Azure CLI for Cosmos DB (version 0.2.3 at time of writing) doesn’t expose a flag for specifying unique key paths on collections. To configure unique keys in our deployment pipeline, we dipped down into the Cosmos DB REST API. The JSON key for specifying unique key paths is not documented, but the functionality is there. Here’s an example payload used to create a collection called ‘books’ with a unique constraint on the ‘id’ field:
Despite these nuances, Cosmos DB has satisfied our use case of migrating from an unmanaged, NoSQL database deployed across several Azure VMs to a fully managed database. We’ve really enjoyed the transparency offered by the real time monitoring features, as it has allowed us to understand our traffic patterns and allocate request units accordingly. Best of all, we haven’t been SSH’ing into Cosmos DB to troubleshoot issues and perform updates! Looking back, even in light of the challenges we’ve encountered thus far, I think we made the right decision to move to Cosmos DB. Hopefully knowing these 5 things will help your migration go more smoothly.