Minimum Viable Product - Part 7 - Searching Content

Visitors need to be able to find your content

As we move along in this series we've come to a very important piece — the ability for users to search for content on your site. While we could dump out every single article on directory pages neatly organized by month and year to allow our visitors to scan a wall of links, let's accept we're not barbaric nor is this 1994.

A naive vs modern approach

I remember the first time I wrote a rudimentary site search by writing some SQL code that looked like this:

SELECT *
FROM myArticles
WHERE 
    articleName LIKE '%foo%'
    OR articleBody LIKE '%foo%'
ORDER BY articleDate DESC

Back in the day, that code was 'good enough' for most simple searches as long as you had a good grasp of the schema, the fields to search didn't change, there were no misspellings in the keywords and didn't care about relevancy scores. Fast forward to present day, searching is much much more sophisticated. Fortunately for us, Umbraco uses such a sophisticated search framework called Lucene that is run by the Apache Foundation. Lucene is a flexible database free, file system based indexer written in Java. Don't worry, we won't be getting out our Java IDE for searching, rather Lucene has been ported to .NET in a library called Lucene.NET. And to make it even easier to work with Umbraco, Shannon Deminick wrapped all the Lucene.Net goodness into another package called Examine which makes adding separate searchers within Umbraco super easy. My advice for anyone using the Google Search Appliance is to switch to Examine.

For the purposes of article we'll just use 'Examine' when talking about all of this search awesomeness. However if you need to google some things about the syntax used, the Lucene syntax is language agnostic.

Examine pieces and parts

Out of the box, Examine is configured with an internal and external searcher and those are paired respectively with an internal and external indexer. The searchers search the indexes. The indexes are built (or rebuilt) based on different events. On the first load of Umbraco, if either the internal or external index doesn't exist, an asynchronous worker builds them as files located in your local '~/App_Data/TEMP/ExamineIndexes' folder. Having personally worked only with SQL before I learned about Examine, it kind of shocked me that Umbraco searches happen this way. Every time a node is saved or published, these indexes need updated. Rather than rebuild the whole shebang, the indexes are updated with only the changed node data.

If you go into the 'Developer' section of the backoffice, you'll be greeted with a tab that reads 'Examine Management'. Here you can rebuild an index on demand or visualize what is in a particular index. These tools will be helpful when testing out our custom search later.

One thing that I haven't yet distinguished, what is the difference between the internal and external index? The internal reflects saved content (in general) whilst the external represents published content (in general). When we build our custom search in a moment, we'll perform our searches off of the external. Check out the image below of the Examine Management dashboard: 

Enough already! Can we get to the search part?

Ok, that was a quite a bit of lead up to building our search page. The search page for this site is using it's own document type with a template which we'll start with first:

@using KGLLC.Umbraco.Helpers
@using PagedList
@using PagedList.Mvc
@inherits Umbraco.Web.Mvc.UmbracoTemplatePage
@{
    Layout = "Base.cshtml";

    //our keywords are simple query string params
    var keywords = HttpContext.Current.Request.QueryString["q"];

    //if no keywords, send the user back to the homepage
    if (string.IsNullOrEmpty(keywords))
    {
        Response.Redirect("/");
        return;
    }

    //this is custom searcher code we'll investigate in a minute
    var searcher = new SearchHelper();

    //our custom searcher returns our results and we can order it by the 'score'
    var results = searcher.Search(keywords).OrderByDescending(x => x.Score);

    //use 'p' query string as the page number for pagination
    var pageQueryString = HttpContext.Current.Request.QueryString["p"];

    //default
    var pageNumber = 1;

    if (pageQueryString != null)
    {
        pageNumber = Convert.ToInt32(pageQueryString);
    }
   
    //using an external library to handle the pagination, set your items per page here
    var pagedResultList = results.ToPagedList(pageNumber, 10);
}

<div id="main" class="search-page">
    <section class="container">
        <h2>Showing @pagedResultList.TotalItemCount result(s) for keywords "@keywords":</h2>

        <ol>
            @foreach (var result in pagedResultList)
            {
                //the search result only gives us limited information, so let's grab the IPublishedContent for the node Id and the output what we want
                var umbracoContent = Umbraco.TypedContent(result.Id);
                
                <li>
                    <h3>
                        <a href="@umbracoContent.Url">@umbracoContent.Name</a>
                    </h3>
                    <small>@umbracoContent.Url</small>
                </li>
            }
        </ol>

    </section>

    <!-- the page buttons at the bottom get generated here -->
    <section class="container">
        @Html.PagedListPager(pagedResultList, page => ("?q=" + keywords + "&p=" + page))
    </section>
</div>

Ok, let's point out a few things, we have a custom searcher called 'SearchHelper' that we haven't explained yet. This is where we'll write out our custom Lucene queries against the external index. We'll see code for that next. We are also using a package called PagedList.Mvc that I highly recommend you use when handling any sort of pagination. PagedList.Mvc also handles generating the page buttons at the bottom of the page. Ok, let's see how we write an Examine/Lucene query:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using Examine;
using Umbraco.Core.Models;
using Umbraco.Web;

namespace KGLLC.Umbraco.Helpers
{
    public class SearchHelper
    {
        public ISearchResults Search(string keywords)
        {
            //grab the external searcher
            var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
            //not 100% sure what this does
            var searchCriteria = searcher.CreateSearchCriteria(Examine.SearchCriteria.BooleanOperation.Or);
            var rawQueries = new List<string>();

            if (!string.IsNullOrWhiteSpace(keywords))
            {
                //split multiple words by a 'space' character
                var words = keywords.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).Select(x => x.ToLower()).ToList();

                //exclude some words
                words = words.Except(new List<string>()
                {
                    "and",
                    "a",
                    "or",
                    "the"
                }).ToList();

                //this is totally up to you how your search works, here I'm searching each word individually
                foreach (var word in words)
                {
                    //this is raw Lucene syntax
                    //in plain English, this search does the following:
                    //+__IndexType:content <== requires the result to be content and not media
                    //-template:0 <== exclude results that have no template
                    //nodeName: ""{0}""~{1} <== the name of the node should have the keyword with fuzziness of 0.5           
                    //mungedField: ""{0}""~{1} <== the index field 'mungedField' should have the keywords fuzziness of 0.5  

                    var contentRawQuery = string.Format(@"(+__IndexType:content -template:0 && (nodeName: ""{0}""~{1} mungedField: ""{0}""~{1}))", word, 0.5);
                    rawQueries.Add(contentRawQuery);
                }
            }

            var query = searchCriteria.RawQuery("(" + String.Join(")(", rawQueries) + ")");

            //send the query string to Lucene and return the results
            return searcher.Search(query);
        }
    }
}

The purpose of the searcher we wrote is to generate the Lucene query and send it off for processing, the result is returned to our view. As you can see in the code above, our searcher uses raw Lucene syntax which is a simple string. However, if you'd like you can use Examine's fluid syntax to use methods to generate the query string through code. I personally prefer direct manipulation of the query. Lucene allows for exact matching, fuzzy matching (words misspelled or similar) and boosting terms. Check out this page for comprehensive syntax documentation.

To tie some things together, let's go back to the Developer > Examine Management dashboard, and do a quick search on the external searcher. We can see our query matches up nicely with the external searcher tester:

By default, Umbraco makes available to the external searcher every property along with its value. This makes it easy for use to code our custom search to use any fields you see listed. There is a common issue, though, that we well need to overcome. Properties on our documents that store values as JSON make for ugly fields to search against. Ideally we'd love just the values of the JSON properties and not all the curly braces and quotes. If you squint your eyes in the image above, you'll see the 'modules' property is storing JSON from our Archetype we used. As a result we're not searching that field in our custom searcher, rather we're searching the 'mungedField' property instead. If you were to go to the BlogPostPage document type, you won't find a property with that name. It's because I added that field dynamically through code via an Examine event:

using System;
using System.Text;
using Archetype.Models;
using Examine;
using Newtonsoft.Json;
using Umbraco.Core;
using Umbraco.Core.Logging;

namespace KGLLC.Umbraco.Helpers
{
    public class ExamineMunger
    {
        //register our Examine event
        public class ExamineMungerRegistration : ApplicationEventHandler
        {
            protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
            {
                base.ApplicationStarted(umbracoApplication, applicationContext);

                ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].GatheringNodeData += _mungeNodeData;
            }

            //there is a lot of code in here. This code unwraps the JSON and shoves it all into one field for easier searching
            //this code is also co-mingling the 'teaseDescription' with all of the module data as well
            private void _mungeNodeData(object sender, IndexingNodeDataEventArgs nodeData)
            {
                var sb = new StringBuilder();

                if (nodeData.Fields.ContainsKey("teaseDescription"))
                {
                    sb.Append(nodeData.Fields["teaseDescription"]);
                }

                if (nodeData.Fields["nodeTypeAlias"].ToLower() == "blogpostpage")
                {
                    _handleBlogModules(sb, nodeData);
                }

                _updateMungedField(nodeData, sb.ToString());
            }

            private static void _handleBlogModules(StringBuilder sb, IndexingNodeDataEventArgs nodeData)
            {
                try
                {
                    if (nodeData.Fields.ContainsKey("modules"))
                    {
                        var archetypeValueAsString = nodeData.Fields["modules"];

                        var modules = JsonConvert.DeserializeObject<ArchetypeModel>(archetypeValueAsString);

                        foreach (var module in modules)
                        {
                            if (module.Alias == "richtextModule")
                            {
                                var value = module.GetValue<string>("text");

                                if (value != null)
                                {
                                    sb.Append(" " + value.StripHtml());
                                }
                            }
                            else if (module.Alias == "breakModule")
                            {
                                var value = module.GetValue<string>("text");

                                if (value != null)
                                {
                                    sb.Append(" " + value.StripHtml());
                                }
                            }
                        }
                    }
                }
                catch (Exception ex)
                {
                    LogHelper.Error<ExamineMunger>(ex.Message, ex);
                }
            }

            private static void _updateMungedField(IndexingNodeDataEventArgs nodeData, string value)
            {
                if (string.IsNullOrEmpty(value))
                {
                    return;
                }

                if (nodeData.Fields.ContainsKey("mungedField"))
                {
                    nodeData.Fields["mungedField"] += " " + value;

                    LogHelper.Info<ExamineMunger>(nodeData.Fields["nodeName"] + " - Updating..." + value);
                }
                else
                {
                    nodeData.Fields.Add("mungedField", value);

                    LogHelper.Info<ExamineMunger>(nodeData.Fields["nodeName"] + " - Creating..." + value);
                }
            }
        }
    }
}

Ok, I apologize. We've seemed to cover a few topics here that might be a little more advanced than for the intended audience. I do feel it's necessary pointing out that you can do a lot of these sort of things to clean up your data for nicer more relevant results.

The search form

Ok, to this point we've explained a bit about Lucene/Examine, talked about our view, built a custom searcher and then tied into Examine's event system to fiddle with the index fields. The last part is the simple form that I dropped into the header. When submitting, the user is simply sent to the `/search` page and the keywords are passed via the query string:

@{
    var keywordsQuery = HttpContext.Current.Request["q"];
}
<section id="header" class="container-fluid">
    <div class="col-md-6 masthead">
        <h1><a href="/">Kevin Giszewski</a></h1>
        <p>Web Developer, Blogger, Freelancer, Filmmaker, Father</p>
    </div>
    <div class="col-md-6 not-masthead">
        <form class="navbar-form pull-right" role="search" action="/search" method="GET">
            <div class="input-group">
                <input type="text" class="form-control" placeholder="Search" name="q" id="keywords" value="@keywordsQuery">
                <div class="input-group-btn">
                    <button class="btn btn-default" type="submit"><i class="glyphicon glyphicon-search"></i></button>
                    <input type="submit" style="display:none" />
                </div>
            </div>
        </form>
    </div>
</section>

Summary

Whew! That was a lot to sink in if you've never used Examine before. I highly recommend you learn how to use it versus using Google Site Search, a Google Search Appliance or any other solution. Keep in mind that you can even add your own indexes to Umbraco using any data source like a database or even Azure Blobs. The only thing I've noticed Examine struggle with is PDF indexing. There is a package to handle PDF's but I've gotten mixed results due to the many PDF formats in the wild. Next up in our series will be a simple RSS feed.