MongoDB text indexing in action

Recently, I had a good use-case to use MongoDB text-indexing and I gave it a shot. I found it to be pretty awesome – even though its not have full-fledged text-search engine capabilities (like facets etc.) it does the job for simple text searches.

So, what did we want to do? We had the following model

class GeneralEntity
  include Mongoid::Document

  field :org_name, type: String

  # some other fields

  embeds_one :contact
end

class Contact
  include Mongoid::Document

  field :last_name, type: String
  field :first_name, type: String
  field :middle_name, type: String
  field :email, type: String
  field :city, type: String
  field :state, type: String
  field :zip_code, type: String

  embedded_in :general_entity
end

Now, we had to search for people like “John Doe” or “Jane” or with email “david@example.com”. I decided to give text indexing a shot.

Some of the salient features of text indexing are tokenizing, stemming and relevance scores. Basically, the words are split using the default token (white space), there is support for multiple languages and each result has a weightage score which tells us how relevant that result was.

Getting Started

To get started, you need to enable text-indexing. You can do this in 2 ways:

Add the following text to your mongod.conf file if you are using –config to start MongoDB.

setParameter = textSearchEnabled=true

Alternatively, you can issue a command on mongo cli

mongo admin --eval "db.runCommand({ setParameter: 1, textSearchEnabled: true})"

Then we create the text index:

class GeneralEntity
  include Mongoid::Document

  field :org_name, type: String
  # other fields

  index({
          "contact.first_name" => 'text',
          "contact.last_name" => 'text',
          "contact.middle_name" => 'text',
          "org_name" => 'text',
          "contact.email" => 'text'
        },
        {
           weights: {
             'contact.first_name' => 10,
             'contact.last_name' => 10,
             'contact.middle_name' => 5,
             'org_name' => 5
           },
           name: 'ge_text_index'
        }
   )

   # other stuff
end

A little explanation here:

  • We can provide weightage to our text-indexed fields. By default the weight is 1, so in our case about, first name and last name get maximum weightage.
  • Remember that by default the index name is the appended version of all the indexed field names. The maximum length of the indexed field is 256 characters, so its better to give a name to the index.

Caveat: Remember that indexes are stored in memory, so text indexes can be really huge! So, just how much space does it consume? Here is the statistics:

> db.general_entities.stats()
{
   "ns" : "ent_dev.general_entities",
   "count" : 2494875,
   "size" : 1497012268,
   "storageSize" : 1580052480,
   ...
   "indexSizes" : {
     "_id_" : 80958752,
     "ge_text_index" : 222902288,
   },
  "ok" : 1
}

So, with approximately 2.5 million documents having a total storage size of 1.5GB, the text index size was 222MB – which by MongoDB index standards is just about acceptable but by text search index standards is abysmal. However, it all depends on how much memory you have. (We had am EC m1.large instance, so we had 8GB of memory and this was fine!)

Faceted Search via text index

When the document fields are indexed, the text search result spans all the fields. For example, if I search for ‘John’, the text search (with its proper weightage) will return results for John when it occurs in all the fields i.e. first_name, last_name etc.

Since I wanted a faceted search i.e. I wanted to search for first_name ‘John’, I need to tweak the code a little to support this.

Adding to some more woes, there is no direct support in mongoid as yet to search on text indexes. So, I create a quick module to do this. I needed to return the text-index search as a criteria, so that it could be chained to other results.

So, this is a 2-step process.  First, we run the text search command and get a result and then we return the criteria for that result set.

module Moped
  module Search

    # @params
    # str: Query string based on mongodb text search.
    def search(str, options={})
      options[:limit] = options[:limit] || 50
      # default limit: 50 (mongoDB default: 100)

      res = self.mongo_session.command({ text: self.collection.name,
                    search: str}.merge(options))

      # We shall now return a criteria of resulting objects!!
      self.where(:id.in => res['results'].collect {|o| o['obj']['_id']})
    end
  end
end

This is how I could use in anywhere, in my model or controller.


# app/controllers/search_controller.rb#search

  state = params[:general_entity][:contact_attributes][:state]
  org_name = params[:general_entity][:org_name]
  first_name =params[:general_entity][:contact_attributes][:first_name]

  # The text search that returns a criteria!
  @ges = GeneralEntity.search(str)

  # Refine the search result further with faceted search
  @ges = @ges.where(:"contact.state" => state)
  @ges = @ges.where(:org_name => /\b#{org_name}/i) unless org_name.blank?
  @ges = @ges.where(:"contact.first_name" => /\b#{first_name}/i) unless first_name.blank?

Limiting Searches

Now, if I am searching for “John” among 2.5 million documents, I am probably going to get a lot of results and as we can see, I am limiting to the first 50 (which is configurable of course). Suppose, we want to limit the search, we could mandate the ‘state’ field and add it as a filter! This drastically improved my search result.

The gist is available here if you are interested in seeing how faceted search is done.

We can further limit search by searching for phrases! We can directly escape the search string with double quotes if required. For example, \”john doe\” is a search for all documents with “john  AND doe” and not “john OR doe”.

Well, though MongoDB documentation site does say its not good for production, it has my vote. It’s not the best in the class for text searches (I recommend using ElasticSearch instead if you need to turn on the heat) but for simple stuff – it works!

Advertisements

About Gautam Rege

Rubyist, Entrepreneur and co-founder of Josh-Software - one of the leading Ruby development shops in India.
This entry was posted in Ruby on Rails, Search and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s