IndexTank got acquired by LinkedIn just a while back and I received a newsletter saying that they may discontinue services in 6 months. My tweet about this got a few responses from Pat Allen (@pat – the creator of Flying Sphinx) and others, which ended up with me promising a brief comparison between Thinking Sphinx (TS), WebSolr (Solr) and IndexTank and why I love IndexTank!. There are plenty of documents and posts on how to setup, configure and use TS and Solr. This post does give information about using IndexTank and the various facets available with it that I find awesome and/or non-existent in TS or Solr.
My experiments with Full-text searching
When I first set out needing full text searching, I used Solr. It was pretty good though re-indexing took ages and to ensure consistency, I had to re-index every day via cron. Then I found Thinking Sphinx – and loved it because it managed delta indexes! Wow – no more daily re-index cron jobs. Even the re-indexing was way quicker.
TS was easy to configure and it generated a production.sphinx.yml, which I could tweak (at my own risk) to get it to index what I want – a bad practice but works well if you know what you are doing.
The big issue with both Solr and TS was that it required tight integration with models and my database. For example – in TS, if a relationship was changed, I had to ensure to trigger the parent / child delta index in order to ensure it gets indexed too. Both TS and Solr add methods to ActiveRecord, which I find a little annoying.
These nuances gets my code too dependent on TS or Solr and switching from them to something else becomes a big pain!
IndexTank makes an entrance.
IndexTank is a NoSQL hosted service for full text search. What I liked best about IndexTank is that I had my application integrated in about 15 minutes – no frills, 3rd party storage. Wow!
The first thought is ‘performance hit’ … when I checked from Rails console — result was totally acceptable! Of course performance is relative but when you consider that the server load is reduced and a third party is doing a full text index on my data — the cost is covered.
Accessibility is awesome. I can create and update documents on IndexTank at will! That is I dont need to index / re-index or add delta’s to my database, because IndexTank is not dependent on my database or even my schema!
I can now index what I want and how I want. Here is a small example:
# config/initializers/indextank.rb CLIENT = IndexTank::Client.new('http://:67xN9mHBV7BV8w@iej.api.indextank.com') INDEX = CLIENT.indexes('idx')
I like the fact that there are no rake tasks for configuration, starting and stopping. Since its a 3rd party service and not a separate process on my server, my server resources are not utilized. The icing on the cake is that we can manage multiple indexes on IndexTank and hence segregate the entire data, not just subset it!
Using it is pretty simple:
# app/models/user.rb INDEX.document("User:id:#{self.id}").add(:text => "#{self.name} #{self.address}") # basically what ever I want
It’s interesting to note that self.name could be a attribute accessor but self.address could be a method which returns a GeoCoder formatted address – the point being that I can send processed output to IndexTank! Now, when I want search results, I simply do this
results = INDEX.search("something") => {"matches"=>2, "query"=>"something", "facets"=>{}, "search_time"=>"0.009", "results"=>[{"docid"=>"User:id:33", "query_relevance_score"=>-2217269.0}, {"docid"=>"User:id:38", "query_relevance_score"=>-2739353.0}]}
So, I now get the number of matches, the objects which matched sorted according to their relevance! ‘Relevance’ – what is that you ask? Its an amazing feature of IndexTank called scoring functions. The default scoring function is by time of creation but we can easily register new scoring functions so that we get the objects in the order we want. Wowie!
Note that ‘text’ is the default keyword for IndexTank for storing data. But there is a LOT more to this. I can not only provide my own keys instead of text, I can also use some facets like ‘categories’ to limit my search. For example, I can modify my code earlier for insertion like this:
INDEX.document("User:id:#{self.id}").add(:text => "#{self.name} #{self.address}", :categories => {:type => 'admin'})
I added the ‘admin’ category for users. And when I search using the same query as above, I get this result
results = INDEX.search("something") => {"matches"=>2, "query"=>"something", "facets"=>{'type' => {'admin' => 1, 'user' => 1}}, "search_time"=>"0.009", "results"=>[{"docid"=>"User:id:33", "query_relevance_score"=>-2217269.0}, {"docid"=>"User:id:38", "query_relevance_score"=>-2739353.0}]}
Notice this line in particular in the results, that filters my search result.
"facets"=>{'type' => {'admin' => 1, 'user' => 1}}
A slight modifications in my search call renders sub-set results:
results = INDEX.search("something", :category_filters => { :type => 'admin'})) => {"matches"=>1, "query"=>"something", "facets"=>{'type' => {'admin' => 1}, "search_time"=>"0.007", "results"=>[{"docid"=>"User:id:33", "query_relevance_score"=>-2217269.0} ]}
Not that this is not doable in TS or Solr – but its not as usable as IndexTank. Please educate me otherwise. This post is getting long already, so I plan to detail out IndexTank faceting and scoring functions in my next post and keep this one as a comparison between IndexTank and TS or Solr.
Heroku Setup
First and foremost IndexTank has the basic version free! At the time of writing this post, the lowest version of FlyingSphinx is ‘wooden’ version (12$ per month) and WebSolr has the ‘Silver’ version. (20$ per month). This is pretty steep when it comes to prototyping, basic building of small apps. IndexTank free version has upto 100,000 documents storage which is free – which is more than enough to support a small application.
Further-more, for both TS and WebSolr, we need workers which are run by in the daily cron. This does increase the cost further.
The FlyingSphinx and Websolr configuration on Heroku is no simple task either. There are quite a few caveats to cater too and it can still cause some hiccups. Its indeed worth mentioning the the support you get from FlyingSphinx and WebSolr is indeed awesome. But I would expect it if I were paying a fee for the setup 😉
Hope this stirs up a healthy debate – (and secretly hoping that IndexTank services continue even after 6 months).
Hi Gautam, thanks for that writeup. Pretty interesting.
Most of what you note is actually only superficially different from Solr, which makes some sense, considering IndexTank uses Lucene under the hood and thus comes from the same conceptual lineage as Solr.
Actually, the client syntax here reminds me a lot of RSolr (https://github.com/mwmitchell/rsolr). I even whipped up a quick side-by-side comparison at https://gist.github.com/1292850#file_rsolr.rb. Based on that, I think it should be almost trivial to implement a client that’s syntax compatible with what you’re doing above, but powered by Solr under the hood.
Anyway, thanks for the feedback! I’m always interested to learn more about what works for different people as well as borrow good ideas wherever I can find them.
Nick,
I am not sure if IndexTank borrows from Lucene – they have not open-sourced their stack (yet). I believe they are going to shortly though. Does Solr support scoring functions – I dont think so (or please educate me).
Its also not just the IndexTank client — for me a FTS engine ‘coolness factor’ depends on
– ease of client usability (and server setup if applicable)
– speed in terms of time
– performance in terms of scale and concurrency
– customization
Considering all these factors – IndexTank rocks!
Hi Gautam
Great to have this reflection on the strengths of IndexTank. Just a few thoughts of my own:
* The settings in Sphinx config files generated by Thinking Sphinx can be modified reliably through config/sphinx.yml – which is far easier (and part of TS’s workflow). I realise it might be some time ago, but I don’t suppose you remember what you were changing in the generated file?
* Sphinx definitely has relevance weights and scoring functions – and I’m sure it exists in Solr as well.
* I agree that auto-updating of index data is the best approach – Sphinx has (relatively recently) got that functionality, although it’s not as fully-featured as their SQL-based indexes and sources. I plan for Thinking Sphinx 3 to take advantage of that – it remains to be seen how easy it is in comparison to IndexTank, though.
* I’d love to have a free plan for Flying Sphinx, but RAM is not cheap, and that’s really the limiting factor when it comes to hosting many Sphinx daemons on servers. Daily cron jobs are free, though.
* That said, I’ve tried hard to make sure Flying Sphinx setup is as easy as possible – and there’s not much in the way of caveats. Are there any particularly painful/confusing/odd/annoying parts of the setup? And that goes for Sphinx/Thinking Sphinx as well – I’m keen to ensure things are as streamlined as possible.
* I’m also loath to add too many methods to ActiveRecord::Base – Thinking Sphinx 3 will be much better in that regard (I’m still working on it, the code isn’t available for use yet).
Again, thanks for the feedback, we don’t see enough of these posts comparing different libraries for the same purpose 🙂
Pat
Awesome comment.
I was not aware of all these features were already available in TS – partly due to necessity and probably because of ‘just get it done mate’ attitude. I wonder if it may also be because of lack of documentation pointers to these critical features! What do you say?
Also confirmed with Nick that relevance and scoring functions are there in Solr. This is great! I am pretty sure not too many people use them – but would greatly benefit from them.
Once IndexTank becomes open-source (they do mention something to that tune), it will be easier to compare these 3 top FTS engines.
On the note of changing the generated sphinx file, I do remember it was something to do with custom queries between multiple SQL database lookups – I’ll dig it up and let you know the details.
Nick, Pat,
What are your thoughts on ‘complete database independence’ for FTS engines?
Perhaps the relevance weights could be documented better – but it is Sphinx’s default sorting value, so I guess I’ve just assumed people will read about it when they look to change sorting behaviour.
As for complete database independence – I don’t find my databases change dramatically often, and so I don’t feel it’s an important issue. Also, with regards to Sphinx – it’s so SQL-focused (which provides it with such fast indexing speeds) that you can’t really escape being dependant on the database.
That said, the real-time index updates rely on different index definitions (realtime as opposed to SQL), and so Sphinx would be far less dependant on the database in that situation.
But really, at the end of the day you’re dependant on something – whether that’s the database or the model’s methods or some other data source.
Hi, I think you should have a look at ElasticSearch: http://www.elasticsearch.org/, a full-text search engine and NOSQL database based on Apache Lucene. It could appeal to you in terms of interface and accessibility, which you seem to have appreciated while using IndexTank. (There are couple or Rubygems providing integration, see http://www.elasticsearch.org/guide/appendix/clients.html.)
Regards,
Karel
Looks very promising Karel – Thanks.
I shall start evaluating ElasticSearch. It should make an interesting Heroku addon!
I would love to hear your idea on ElasticSearch. I’m investigating to choose ES or Indextank.
@khanhtran – We recently learnt about searchify! Its the IndexTank clone. We are still working it out with ES but I would strongly recommend using searchify.com
rest