Examining Examine

When Umbraco Examine was released many developers including myself thought all my search woes were solved. I figured I would be able to just fire up Umbraco, configure examine and BAM! I would have a search engine that gave me exactly what I needed. But after giving it an initial go and trying it out, most developers left it alone and put it in the too hard basket as they did not understand what was going on under the surface or why their search results were not what they expected. What I want to try and achieve in this post is to demystify Umbraco Examine and provide a few tips I have picked up while working with it.

What is Examine?

First, for those that don't know what I am on about let's just start by telling you what Examine is. Examine is a provider based Indexer/Searcher API and wraps the Lucene.Net indexing/searching engine. Examine makes it super easy to work with Lucene indexes and allows you to query or index almost any content with very little effort. Lucene is SUPER fast and when managed correctly can allow you to open up super fast searching on absurd amounts of data.

What is Umbraco Examine?

Umbraco Examine is the Umbraco implementation of Examine. Examine is not exclusive to Umbraco and can actually be used as a completely stand alone component on any project that needs a fast Index. Umbraco Examine is a combination of Umbraco, Examine and Lucene.Net and uses Umbraco as the data source for its Lucene index. Lucene has been a part of the Umbraco CMS from the early days and powers the back office search.

The Basics of Examine

Out of the box Umbraco comes configured to index content and members in the back office for the back office or internal search engine that appears in the header. As Examine is configuration driven you can quickly modify or set up indexes and searchers. The best place to start is with an index set.

You configure an index set in the /config/examineIndex.config file. This file defines our indexes. As mentioned earlier, there are two internal default indexes configured in this file, and they are called "InternalIndexSet" and "InternalMemberIndexSet". Index sets are extremely easy to set up. The quickest way to get your index set up and running is with a single line of code.

<IndexSet SetName="WebsiteIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/WebsiteIndexSet/">

Really, that is all you need. You can actually leave out a majority of the configuration and Examine will revert to the default which is to index all your data and all your properties. But if you want a little more control you can use the full configuration options as shown below.

<IndexSet SetName="WebsiteIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/WebsiteIndexSet/">
    <IndexAttributeFields>
        <add Name="id" />
        <add Name="nodeName" />       
        <add Name="updateDate" />
        <add Name="writerName" />
        <add Name="path" />
        <add Name="nodeTypeAlias" />
        <add Name="parentID" />
    </IndexAttributeFields>
    <IndexUserFields>
        <add Name="bodyText"/>
        <add Name="metaDescription"/>
        <add Name="metaTitle"/>
    </IndexUserFields>
    <IncludeNodeTypes>
        <add Name="TextPage"/>
    </IncludeNodeTypes>
    <ExcludeNodeTypes>
    </ExcludeNodeTypes>
</IndexSet>

Let's break it down.

SetName defines the name that we will use to reference this set in configuration and code later.

IndexPath defines where the actual index will reside. The standard root location in Umbraco is in App_Data/TEMP/ExamineIndexes folder

<IndexAttributeFields>
This is where we define which of the standard properties (not user defined properties) we wish to index.

<IndexUserFields>
These are the properties that we have defined against our DocumentTypes that we want to index. Use the property aliases.

<IncludeNodeTypes> & <ExcludeNodeTypes>
I like to think of these as the white list and the black list. Basically figure out if you have to include more or exclude more nodeTypes. Generally which ever one requires less configuration is the one that you should use but that is completely up to you and how much you like typing.

Just remember that creating errors in index configuration is easy if you make spelling mistakes. You will know if you have made errors as your index will either not create or not include the features you expect. We will look at how to check your index content later with Luke, a Java based tool that allows us to look into our index and run queries against the index directly.

So now we have defined our index, we actually now need to tell Examine what to do with this configuration. In our next step we need to configure our Indexer and our Searcher. This is done in the /config/ExamineSettings.Config file.

This file contains two main sections, <ExamineIndexProviders> and <ExamineSearchProviders>

Setting Up your IndexProvider

We will start out by creating our ExamineIndexProvider. As the name suggests this tells Examine how to manage our Index. Again, contrary to what you may have been told you can actually leave out most of the optional configuration and simply specify your indexer in a single line of code.

<add name="WebsiteIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

But if we want more control we can define our Provider by creating the configuration as follows:

<add name="WebsiteIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" 
    dataService="UmbracoExamine.DataServices.UmbracoDataService, UmbracoExamine" 
    supportUnpublished="false" 
    supportProtected="false" 
    analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" 
    enableDefaultEventHandler="true"/>

Let's look at each of the options.

name: This is the alias that we are going to give our indexer. You may notice above that we have given it the name WebsiteIndexer. If maintaining the naming conventions (Suffix of Indexer, Searcher, IndexSet) Examine will be able to wire up the parts easily and we can omit certain configuration elements. If however we decided that we wanted to call our indexer something else we would need to specify the indexSet attribute so that our indexer knows what it is wired up to.

type: Because we are indexing Umbraco Content we need to specify the type as being UmbracoExamine.UmbracoContentIndexer.

dataService (optional): The type that this provider will instantiate in order to query Umbraco for the data that it requires. Generally this shouldn't need to change unless you want to use test data from a non-Umbraco source or you have very custom requirements.

indexSet (optional): Explicitly specifies the index set to use. Generally this is wired up based on naming conventions.

supportUnpublished (optional): As the name implies specifies whether or not you want to index unpublished content.

supportProtected (optional): Set this to true if you want it to index your protected content.

analyzer (optional): The Lucene.Net analyzer to use when storing data. There are a number of different analyzers available and they all do slightly different things. If you want a good overview of the different analyzers check out https://www.aaron-powell.com/lucene-analyzer

enableDefaultEventHandler (optional):Will automatically listen for Umbraco events and index when required. If you are relying on the standard Umbraco events, setting this False could be a simple way of disabling indexing if required.

Now that Examine knows how to index our site. It should be reading the config and slamming your content into an index.

Checking your Index

The easiest way to check that your config is correct is to first go and look in the IndexPath that you specified in your IndexSet. You should see a set of folders and files. If you want to check the contents of the index you can do this using a tool named Luke which you can either download or launch from this page http://www.getopt.org/luke/.

Using Luke, you can open up your index and see its content. It should give you statistics on the elements, number of documents indexed etc. If however you don't see indexed items then it's likely you have either made a mistake in your indexSet configuration or it has not indexed yet. When you are finished with Luke it's important to close the index to get Luke to release its lock on the files.

Luke can also be used to run raw Lucene searches against your index. This can be helpful during development or to debug your queries. Later on in this document you will start to learn how these raw queries fit together.
Setting up your Search Provider
We have a provider for our indexing, now we need to set up our search provider. We do this by adding a new provider in the <ExamineSearchProviders> section of the examinesettings.config file. The simplest search provider element looks like this:

<add name="WebsiteSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>

Let's break it down, again.

name: This is the alias for our search provider. Again if we stick to naming conventions we can omit the indexSet attribute as it is able to automatically determine this from its own name.

type: Because we are searching Umbraco content we specify the type as UmbracoExamine.UmbracoExamineSearcher as the type of search provider.

analyzer (optional): Just like the indexer our search provider also needs to know what type of analyzer to use when searching the index. Make sure that your search is using the same analyzer as your indexer. Think of it as the language the indexer writes, the searcher needs to be able to read, just like if you write in English don't expect to be able to read French.

Awesome! Our configuration is now complete! Our site should now be indexing our content and has a way of accessing it via a search provider. However, configuration is only part of the story... we now need to do some coding.

Querying with Examine

So without further ado let's cut straight to the chase and focus on querying your indexes. We are going to look first at the Fluent API which allows you to quickly create queries using an easy to learn chaining syntax, and then at building your own Raw Lucene queries for more complex scenarios.

Fluent API

Examine contains a powerful fluent API that enables you to create searches by chaining up query conditions and then pass that chain to Examine to return results. Let's look at formulating a query.
First we need to specify our search provider that we want to do the searching for us.

var Searcher = ExamineManager.Instance.SearchProviderCollection["WebsiteSearcher"];

The alias "WebsiteSearcher" comes from the searcher that we have configured examinesettings.config file.

Next we want to create an instance of ISearchCriteria that we will use to build our search query.

var searchCriteria = Searcher.CreateSearchCriteria();

Next we are going to chain up query. Lets look at one of the most common tasks you would be using examine for, which is to search a set of nodes for a particular word. We have a number of operators available to us to do this. These are Or(), And(), Not() and Equals(). Lets look at the Or() operator.

var query = searchCriteria.Field("nodeName","hello").Or().Field("metaTitle","hello").Compile();
var searchResults = Searcher.Search(query);
yourRepeater.DataSource = searchResults;
yourRepeater.DataBind();

Now this is where things often go wrong for developers and why they often get a little frustrated. What they assume is that this query is going to return them results where either nodeName or metaTitle contain hello. But what actually happens with this query is that by default Examine reads this query as nodeName MUST contain hello or metaTitle SHOULD contain hello. So you will often get odd results or none at all.

So Examine feeds the query to Lucene as follows.

+nodeName:hello metaTitle:hello

The + specifies that it MUST meet this rule.

Fortunately there is an easy way to fix this behaviour, by specifying the default operator for our search criteria to be BooleanOperation.Or. The simplest way of explaining what this does is that without it the default operation that Examine uses in Lucene terms is MUST (or AND in Examine terms). Because of this default behaviour, it means that if you are unaware of it you will probably see some strange results. By passing in the default operator of OR (in Lucene terms SHOULD) you stop the default AND operation. To fix this change the creation of the instance of the ISearchCriteria to pass in a BooleanOperation.Or like this:

var searchCriteria = Searcher.CreateSearchCriteria(BooleanOperation.Or);

Now when Examine passes the query to Lucene it will pass as this:

nodeName:hello metaTitle:hello

Which in simple terms means give me anything where nodeName or metaTitle contain hello. Much better.

So with this knowledge let's look at another example using the And() operator.

var query = searchCriteria.Fields("nodeName","hello").And().Field("metaTitle",hello").Compile();

without passing the BooleanOperation.Or into our ISearchCriteria we would get the following for our Lucene query

+nodeName:hello +metaTitle:hello

Which means give me results where nodeName MUST contain hello AND metaTitle MUST contain hello.

With the BooleanOperation.Or we would get this:

nodeName:hello +metaTitle:hello

Which means give me results where nodeName SHOULD contain hello AND metaTitle MUST contain hello.

What if we want grouping? Well with Examine you get GroupedAnd() and GroupedOr() which as you would expect mean we can group up fields to look in and pass in a query term. I'm going to assume that I am using the BooleanOperation.Or from now on so will only show the result that it gives using that default operator.

var query = searchCriteria.GroupedOr(new string[] { "nodeName", "metaTitle"}, "hello").Compile();

This would give a Lucene query that looks like this:

(nodeName:hello metaTitle:hello)

You can also pass in a group of query value.

var query = searchCriteria.GroupedOr(new string[] { "nodeName", "metaTitle"}, new string[]{"hello", "goodbye"}).Compile();

this would end up being:

(nodeName:hello metaTitle:goodbye)

I think you get the gist.

But wait, there's more! What if you want to start combining operators? You can! It's not called chaining for nothing...

var query = searchCriteria.Field("nodeName","hello").And().GroupedOr(new string[] { "metaTitle", "metaDescription"}, new string[]{"hello", "goodbye"}).Compile();

This would end up being:

nodeName:hello +(metaTitle:hello metaDescription:goodbye)

Fuzzy
Sometimes users will query your site looking for a term that they could have misspelled or is very close. Fuzzy gives you the ability to get Lucene to look for terms that look like your term. Eg mound could actually be sound.

var query = searchCriteria.Fields("nodeName","hello".Fuzzy(0.8)).Compile();

The optional value you pass into Fuzzy between 0 and 1 specifies how Fuzzy or how close the match is to the original. For instance a match of 0.5 will not return when a threshold of 0.8 is specified.

Boosting
Sometime you want to give a field more relevance than others. Thankfully we can use the concept of Boosting. What this does is give a particular query a higher relevance. That means that you can for instance say that if the nodeName contains the query then Boost it because its more relevant than those that only contain the term in the body.

var query = searchCriteria.Fields("nodeName","hello".Boost(8)).Or().Field("metaTitle","hello".Boost(5)).Compile();

If you're curious as to the Math of this, don't be. First rule of Lucene is never ask how the Math works.

Power Searching with Raw Lucene Queries
Cool, we have learnt a lot. BUT the Fluent API can't do everything. Let's look at a complex example that is not achievable with the Fluent API. Let's say we want to do a search with a phrase, boost exact matches, find results that contain any of the parts of our search phrase. The Fluent API will not be able to do this. So this is where you need to know a little bit of Lucene query syntax. Along the way I have been showing you what certain queries look like when we convert them from the fluent API to Lucene syntax. It's not as complex as it looks.

Let's say our query is for the following phrase: "paging in XSLT". Most times when you search with Examine you will unfortunately either get nothing back or get back only a few results that match exactly. To see why, let's look at how we might do this with Fluent and then look at the problem when it's converted:

var query = searchCriteria.Fields("nodeName","paging in XSLT").Compile();

When this converts it looks like this:

nodeName:"paging in XSLT"

The problem becomes apparent straight away as you can see it quotes the phrase which basically means the term is the whole phrase. What we want from Lucene is the following:

nodeName:paging in XSLT

What this basically means is that field nodeName needs to contain either paging, in, or XLST.

So to achieve this we need to build our custom Lucene query and then pass it to Examine as a Raw query.

var term = Request["q"];

var luceneString = "nodeName:" + term;
var query = searchCriteria.RawQuery(luceneString);

Cool, now our search is starting to perform more like how we want it to. Now what about making it boost matches containing all the terms higher? No worries.

var luceneString = "nodeName:";
luceneString += "(+" + term.Replace(" ", " +") + ")^5 ";
luceneString += "nodeName:" + term;

The resultant query would look like this:

nodeName:(+paging +in +XSLT)^5 nodeName:paging in XSLT

The ^5 is the equivalent of .Boost(5)

So as you can see, it can get pretty complex but also pretty powerful. For more information I really suggest that you take the time to look through the official Lucene query syntax guide as it will show you what can be used to achieve very complicated queries.

We haven't gone into the output of results but quickly here is what you could end up with in your codebehind using the above Raw query:

var Searcher = ExamineManager.Instance.SearchProviderCollection["OurIndexSearcher"];
var searchCriteria = Searcher.CreateSearchCriteria(BooleanOperation.Or);
var term = Request["q"];

var luceneString = "nodeName:";
luceneString += "(+" + term.Replace(" ", " +") + ")^5 ";
luceneString += "nodeName:" + term;

var query = searchCriteria.RawQuery(luceneString);
var searchResults = Searcher.Search(query).OrderByDescending(x => x.Score);
yourRepeater.DataSource = searchResults;

yourRepeater.DataBind();

And then in your code front the objects that you are iterating over are of type Examine.SearchResult. So to databind on those you would have syntax like the following:

<%# ((Examine.SearchResult)Container.DataItem).Fields["nodeName"] %>

In conclusion Examine is awesome and with a little bit of information and a little patience to get to know it, you can be up and running with some very complex querying.