Recently, I had a good use-case to use MongoDB text-indexing and I gave it a shot. I found it to be pretty awesome – even though its not have full-fledged text-search engine capabilities (like facets etc.) it does the job for simple text searches.
So, what did we want to do? We had the following model
class GeneralEntity include Mongoid::Document field :org_name, type: String # some other fields embeds_one :contact end class Contact include Mongoid::Document field :last_name, type: String field :first_name, type: String field :middle_name, type: String field :email, type: String field :city, type: String field :state, type: String field :zip_code, type: String embedded_in :general_entity end
Now, we had to search for people like “John Doe” or “Jane” or with email “david@example.com”. I decided to give text indexing a shot.
Some of the salient features of text indexing are tokenizing, stemming and relevance scores. Basically, the words are split using the default token (white space), there is support for multiple languages and each result has a weightage score which tells us how relevant that result was.
Getting Started
To get started, you need to enable text-indexing. You can do this in 2 ways:
Add the following text to your mongod.conf file if you are using –config to start MongoDB.
setParameter = textSearchEnabled=true
Alternatively, you can issue a command on mongo cli
mongo admin --eval "db.runCommand({ setParameter: 1, textSearchEnabled: true})"
Then we create the text index:
class GeneralEntity include Mongoid::Document field :org_name, type: String # other fields index({ "contact.first_name" => 'text', "contact.last_name" => 'text', "contact.middle_name" => 'text', "org_name" => 'text', "contact.email" => 'text' }, { weights: { 'contact.first_name' => 10, 'contact.last_name' => 10, 'contact.middle_name' => 5, 'org_name' => 5 }, name: 'ge_text_index' } ) # other stuff end
A little explanation here:
- We can provide weightage to our text-indexed fields. By default the weight is 1, so in our case about, first name and last name get maximum weightage.
- Remember that by default the index name is the appended version of all the indexed field names. The maximum length of the indexed field is 256 characters, so its better to give a name to the index.
Caveat: Remember that indexes are stored in memory, so text indexes can be really huge! So, just how much space does it consume? Here is the statistics:
> db.general_entities.stats() { "ns" : "ent_dev.general_entities", "count" : 2494875, "size" : 1497012268, "storageSize" : 1580052480, ... "indexSizes" : { "_id_" : 80958752, "ge_text_index" : 222902288, }, "ok" : 1 }
So, with approximately 2.5 million documents having a total storage size of 1.5GB, the text index size was 222MB – which by MongoDB index standards is just about acceptable but by text search index standards is abysmal. However, it all depends on how much memory you have. (We had am EC m1.large instance, so we had 8GB of memory and this was fine!)
Faceted Search via text index
When the document fields are indexed, the text search result spans all the fields. For example, if I search for ‘John’, the text search (with its proper weightage) will return results for John when it occurs in all the fields i.e. first_name, last_name etc.
Since I wanted a faceted search i.e. I wanted to search for first_name ‘John’, I need to tweak the code a little to support this.
Adding to some more woes, there is no direct support in mongoid as yet to search on text indexes. So, I create a quick module to do this. I needed to return the text-index search as a criteria, so that it could be chained to other results.
So, this is a 2-step process. First, we run the text search command and get a result and then we return the criteria for that result set.
module Moped module Search # @params # str: Query string based on mongodb text search. def search(str, options={}) options[:limit] = options[:limit] || 50 # default limit: 50 (mongoDB default: 100) res = self.mongo_session.command({ text: self.collection.name, search: str}.merge(options)) # We shall now return a criteria of resulting objects!! self.where(:id.in => res['results'].collect {|o| o['obj']['_id']}) end end end
This is how I could use in anywhere, in my model or controller.
# app/controllers/search_controller.rb#search state = params[:general_entity][:contact_attributes][:state] org_name = params[:general_entity][:org_name] first_name =params[:general_entity][:contact_attributes][:first_name] # The text search that returns a criteria! @ges = GeneralEntity.search(str) # Refine the search result further with faceted search @ges = @ges.where(:"contact.state" => state) @ges = @ges.where(:org_name => /\b#{org_name}/i) unless org_name.blank? @ges = @ges.where(:"contact.first_name" => /\b#{first_name}/i) unless first_name.blank?
Limiting Searches
Now, if I am searching for “John” among 2.5 million documents, I am probably going to get a lot of results and as we can see, I am limiting to the first 50 (which is configurable of course). Suppose, we want to limit the search, we could mandate the ‘state’ field and add it as a filter! This drastically improved my search result.
The gist is available here if you are interested in seeing how faceted search is done.
We can further limit search by searching for phrases! We can directly escape the search string with double quotes if required. For example, \”john doe\” is a search for all documents with “john AND doe” and not “john OR doe”.
Well, though MongoDB documentation site does say its not good for production, it has my vote. It’s not the best in the class for text searches (I recommend using ElasticSearch instead if you need to turn on the heat) but for simple stuff – it works!