Introducing Redis and Azure Cognitive Search to the Umbraco Marketplace

Written by Andy Butland

This blog post is the first in a short series focussing on the introduction of new components to improve performance for the read-only API supporting the Umbraco Marketplace. These technologies were mostly new to us, and you can read more on our findings and how we used them.

In the recent blog post discussing updates to the Umbraco Marketplace we mentioned in passing some "behind the scenes" updates we have made to improve functionality and performance of the website and API that provides the content for it. We also wondered if the technically minded readers would be interested in hearing more about why we made these architectural choices and some of the implementation details of how we went about it.

So that's what we've put together in the following couple of articles.

To be clear, there's no need to follow these to use the Marketplace as a solution or package developer - they are purely for interest for anyone who might like to hear a bit more about the work we do at Umbraco from a technical perspective.

The Umbraco Marketplace Solution

Included in the Marketplace solution is a scheduled process that synchronizes information from appropriately tagged packages hosted at NuGet. This is augmented by additional information provided by package developers via a JSON file hosted on their project websites.

We load that information into a SQL database and expose it via an API. We then have a single page application (SPA) based website that consumes the API and displays information about the Umbraco packages to the website visitors.

The existing combination of reading from the SQL database and using response caching works well, but we think there are some improvements we can make via using some more appropriate infrastructure.

Specifically, we are looking at incorporating two further components into our Azure based cloud infrastructure:

Utilizing a Redis based distributed cache for requests where we are retrieving the full details of a single package by Id.
Introducing Azure Cognitive Search and using that to support requests where we are retrieving a collection of packages based on a query defining various filter and sort options.

Using Redis as a Distributed Cache for the Umbraco Marketplace

A common approach to avoid going to the database to support every request to an API is to introduce some caching, and the easiest way to do this with .NET is via response caching (previously known as output caching). This is applied via attributes on controller endpoints, where, amongst other things, you can specify a time period for which the response to a given request is cached and returned without querying the database again for the results.

Existing Response Caching

This is the approach we had been using for the Umbraco Marketplace since its original release. It’s worked well, and once the cache is populated it’s extremely fast with the response stored in memory. However, it does have a few downsides:

If load balancing, the cache is stored in more than one place (in the memory of each web server) and has a chance of not being the same. It’s possible a visitor with requests served from different nodes could see different results.
With the cache stored in memory, it’s not particularly durable. A deployment, or other application restart, would clear the cache. This means first requests are going to be slower as they will need to go to the database to serve the response.
It’s not straightforward to invalidate the cache, and so depending on your application, it may not be easy to find an optimum time span for keeping the data in the cache.

This last point was particularly relevant for us, given we were fully in control of populating the data. Rather than having to select an arbitrary time for the cache- that may either lead to serving out-of-date data, or unnecessarily retrieving unchanged data from the database - it would be better to cache for a long time and explicitly invalidate when we know there is a change.

Introducing the Distributed Cache

As an alternative to response caching, we can make use of the IDistributedCache interface, provided by Microsoft as part of .NET. There are a few implementations available for this interface.

The first is an in-memory implementation, which is fast but of course, has the same downsides as the response caching discussed earlier. However, it’s useful for local development, as you can work against the interface without having to have any external component available to hold the cache data.

There is also one for SQL Server, which has the benefit of likely already existing in your web application infrastructure but has the downside of still requiring an, albeit efficient, database query to retrieve the data.

Another, which is what we’ll be using in production, is a Redis-backed implementation available via this NuGet package.

You can see from the following gist how we are using the appropriate implementation based on a configuration value:

The advantage of using the Redis cache is that we’ll avoid the downsides of in-memory caching: the cached data will be held in one place for all web servers, it’ll be retained following deployments and we are able to prime and invalidate the cache.

Retrieving Data and Populating the Cache

To retrieve data from the cache we augment our controllers to pass in the registered IDistributedCache implementation via the constructor. We also remove the attribute that was previously there for response caching.

We then follow a typical caching pattern where we:

Generate a key for the cache based on the parameters of the query.
Look in the cache for the value matching that key, and, if found, return it.
If no value is found for the key, we have a “cache miss” and so go to the source, in this case, the database via a service layer, to retrieve the value.
We then populate the cache with the value using the cache key, such that on a subsequent request we’ll get a “cache hit” and can service the request in a more performant way.

The following illustrates this where we retrieve the details for a single category:

The IDistributedCache interface works only with strings, so it’s necessary to take control of the serialization and deserialization into strongly typed objects yourself, which we have done with the following extension methods:

Note that when setting the cache value we have the option to set an expiry, such that the value will be evicted from the cache following a set period. In the case of caching the package representation we aren’t adopting that, instead using no expiry, and we’ll need to take control of removing or updating the cached value when the underlying data changes via another means.

Priming and Invalidating the Cache

With our application, we know when the information about a package is updated as it’s under our control via the schedule process that synchronizes the information from NuGet and other sources. We added a step to the end of that process that creates or updates the cached value.

With that, in theory, once the cache is populated, we should never get a cache miss for the requests that arrive at the API controllers. We haven’t set an expiry, so the values aren’t evicted, and we will be updating the cached value when changes are detected.

In practice, it wasn’t quite that simple. The reason was that the result of a query for a "package by ID" doesn’t only depend on the ID. We have a couple of other parameters we can provide to this query that will modify the output. For example, you can provide a current Umbraco version number, which will customize the result by indicating which version of the package you should install. As such, the cache key for a package actually depends on the package ID, plus these additional parameters.

This means that for a given package, we may have more than one cached value. Populating all of them as part of the package data synchronisation process wouldn’t be particularly efficient – there are many possible combinations of these parameters, and it’s likely only a few will be requested and thus useful to cache.

We decided therefore that, when updating the information about a package, we should:

Update the cache value for the most commonly requested query (the one that provides just the package ID without any additional parameters).
Remove the values for any other variations that have been stored in the cache.

Carrying out the latter was a little tricky, as the IDistributedCache interface is quite limited, being concerned only with requesting, setting or removing a single value by key. Here we want to remove multiple keys that we can identify as starting with a particular string – e.g. PackageById_{packageId}_.

To do that we had to use the following code found from digging into this tracked issue, which, having determined that we are using the Redis implementation of the IDistributedCache, uses reflection to obtain access to the underlying Redis ConnectionMultiplexer. With that, we can drop down to use Redis-specific commands that allow removing values for keys by prefix.

Conclusions

With the implementation now deployed to production, we've been happy with the results. Although we have likely introduced a very small increased latency due to the change in cache implementation from in-memory to an external service, this is likely very small due to the efficiency of Redis for storing and retrieving values by key.

More importantly, we've made the performance much more consistent, such that we can now make deployments without losing our cached data and causing a detriment to the initial experience.

In the next post, we’ll take a look at how we used Azure Cognitive Search to further improve querying and searching on the Marketplace.