Populating and Querying Azure Cognitive Search in the Umbraco Marketplace

Written by Andy Butland

The post is part of a short series focussing on the introduction of new components to improve performance for the read-only API supporting the Umbraco Marketplace. These technologies were mostly new to us, and you can read more about our findings and how we used them. In this post, we’ll take a look at how we improved querying and searching using Azure Cognitive Search.

Working with Azure Cognitive Search

An Azure Cognitive Search instance can be created on Azure, either via the portal or through APIs and an infrastructure as code service such as Terraform. Various SKUs are available, including a free one, which is suitable for development work.

To work with the search index in code you’ll need to add a package reference to Azure.Search.Documents and then add the following values, most of which can be read from the Azure Portal, into configuration such that you can use them in your application.

Service endpoint (the URL available from the "Overview" tab)
Admin API key (available under "Keys")
Query API key (available under "Keys")
Index Name (you'll likely create this in code, so can be whatever name makes sense for your application)

Preparing the Search Schema

Whilst you can use the Azure portal to create an index and populate it with fields, it’s likely easier and more maintainable to define the index in code. It's worth noting first though that there is a schema – working with the search index isn’t as free-form as with some document databases. You not only have to define what fields you are storing, but for each field, you also need to make decisions on how it works within the index.

We found when working with the search service that it wasn’t very accepting of changes to fields, meaning that it seemed it would be necessary to re-index if there are updates to the schema other than when adding new fields. For the amount of information we are storing this isn’t a big deal, but if you had to index a lot of content it could be. In any case, it’s certainly worth working for a while with a small set of data, whilst you finalize the schema, before populating with a vast amount of data.

To adopt a code-first approach to the schema, you can create a set of classes that represent the “document” you are looking to store. This doesn’t need to be a simple set of primitive fields; it’s also possible to define and store complex types (such as in our case the related information about “author” and “category” as well as the main “package” document).

Each field should be decorated with an attribute that determines how can be accessed in the index:

SearchableField is used to define fields that will be referenced as part of the text search, and so any field that stores string data and you want to use this to include a document in the results should have this attribute.
SimpleField is used for fields that either don’t contain string data or don’t contain relevant text for the search results. They can however be returned as part of the results.

Each attribute also has various properties that customize the field, allowing them to be included in filters or facets. And one field needs to be defined as the key, used to uniquely identify documents.

Our API was already returning a strongly typed collection of PackageDto objects, so in theory, we could have looked to decorate and index that as our document in the index. However, partly as this would mix concerns, and partly as we had some subtle differences in which fields to include and how they should be stored, we decided to create a new object, named PackageModel. With mapping to and from our PackageDto, it looked something like this (most properties removed for clarity):

We can then use this decorated class to set up our index.

Populating the Search Index

We decided to include the work to set up the index as part of a method responsible for taking an existing package and using it to create and populate a document in that index.

As such the first step is to check if the index exists, and if not use the FieldBuilder class available to create the index with the necessary schema:

When doing a text search, it's often the case that some fields are more important than others when it comes to prioritizing the results. In our case, for example, a match on the package title should be ranked higher than one made in the description.

This type of information is encoded into the schema via a scoring profile. We define a set of weights for each field along with a multiplier indicating the importance of a match on that field:

With the index prepared we can then index the document. Index population is done as part of a batch, so even though we are only indexing a single document in this method, we need to wrap the individual operation within a batch.

Each operation can be of different types: upload (add), merge (update), upload or merge (add or update), or delete. As you can see here, we are using the “upload or merge” option such that we create a document if it doesn’t exist, and update it by key if it does:

The results of the operation are received as a collection which indicates the detail of each operation. As such if you are indexing multiple documents, you can find the status for each one.

Querying the Search Index

To query the index, we create a SearchClient and provide to the Search (or SearchAsync) method two parameters:

The text we are searching for.
An options object, that defines the fields we want to bring back, and additional filters, paging, and sorting we want to apply to the result set.

Firstly, for selecting the fields, we indicate the specific ones we want to include in the results. In our case the fields to be returned are defined on the query specification object passed into the method:

Filtering is carried out using the expressive OData syntax, of which there are a few examples below. Amongst others, we look for the exact match for a given string value (package type), see if a provided value matches one of a collection (license type), and do a more complex check for whether a package supports a given Umbraco version.

Ordering just requires the field name and optionally a direction:

And finally, for paging, we define the skip-and-take options for our search. The results will include not only the result set but an indication of the total number of documents that matched the search.

With the results of the search operation, we receive back instances of our document model, which we then map back into the DTO class to return to the calling code.

Conclusions

With the search implementation in place, we are seeing some important improvements over the previous SQL server-based solution. One is speed, the time taken to query the search index being faster than the equivalent operation on the database.

We also see an improvement in the quality of the results when using text searches.

SQL Server with full-text indexing provides a reasonable experience, but having a fully-fledged search engine powering the text search certainly improves it. With easier means of handling boosting and features such as stemming ensuring word variations are given correct credit, the results returned to the user are improved in their relevance.

Combined with the distributed caching described in the previous blog post, we've improved both the user experience, maintenance, and further development of the marketplace. We hope sharing the journey and implementation details can provide inspiration. And remember, if you run into any issues or have ideas for how to improve the marketplace you're welcome to reach out to us on the issue tracker.