Mongoid and the MongoDB Aggregation Framework

MongoDB introduced the aggregation framework since Version 2.2 but the power of the aggregation framework has only been tapped in Mongoid only since 3.1.0. Even today in the latest version (currently v4.0.0), the aggregation framework is used only for some basic functionality – :sum, :count, :avg, :min and :max. While this is much better than running a map/reduce to get this information, the potential of the aggregation framework is vastly untapped!

This post discusses how we can leverage the aggregation framework Mongoid. But first, a little brief about how the aggregation framework in MongoDB works and why it’s so awesome!

The Aggregation Pipeline

The aggregation pipeline is conceptually very much like the unix pipe. Output of one command is “piped” into the next command as the input. How does an aggregate query differ from others?

An aggregate always returns an array of documents – that may be manipulated in the pipeline if required. A mongoid criteria will be evaluated lazily and only when required. If we actually fire queries to get data, there isn’t much difference in the individual queries.

[conn1] query sodibee_development.transactions query:
{ account_id: ObjectId('50f8f036da58d2462900079c') }
ntoreturn:0 ntoskip:0 nscanned:58 keyUpdates:0
locks(micros) r:588 nreturned:58 reslen:33172 87ms

[conn1] command sodibee_development.$cmd command:
{ aggregate: "transactions", pipeline: [
{ $match: { account_id: ObjectId('50f8f4cdda58d24629003396') } } ]
 } ntoreturn:1 keyUpdates:0 numYields: 15 locks(micros) r:5016
reslen:22068 72ms

So, when should we use the aggregation framework? As the mongoid gem has shown us, it’s ideal to get us “counts” – sum, average, minimum and maximum values. However, the aggregation frameworks real power is in the grouping of documents.

Group By queries

The aggregation pipeline is an array of operations that process documents in a collection. At each step, as the pipeline processes the documents, the output is fed to the input of the next operator till we get a final aggregated result. An excellent explanation is available in the MongoDB documentation. Lets see a good use-case of using the Aggregation Framework.

We need to get documents in a collection grouped by their transaction type for a particular account. How do we do that?

We do have group_by method supported in Mongoid via ActiveModel. Lets see the benchmarks!

2.0.0p247 :001> Benchmark.realtime { data =
group_by(&:trx_type) }

=> 4.554763174

The mongodb log show us more information:

[conn17407] query sodibee_development.transactions query:
{ $query: { account_id: ObjectId('51148e0a45564c8ebf001e13') },
$orderby: { date: -1 } } ntoreturn:10000 ntoskip:0
nscanned:478513 scanAndOrder:1 keyUpdates:0
numYields: 2 locks(micros) r:4051979 nreturned:10000
reslen:5553930 2486ms

As we can see the time taken in database was 2.486 seconds and the result returns a whopping 5.5MB of data. Even though the database has returned in about 2.5 seconds, it takes 2 seconds more to prepare the ruby objects and return the result in an array of 10000 transaction objects. This is not acceptable by any means.

Lets try to just fetch the object_id and the transactions type instead of the whole object. Lets pluck some data.

2.0.0p247 :001> Benchmark.realtime {
transactions.limit(10000).desc(:date).only([:trx_type, :id]).

=> 3.371775417

Here we fire the same query but pluck only the data that we want. A little bit of improvement on the time but as we see in the database log below, no difference in the time taken. However, the database is now sending much lesser data back – only 472k.

Thu Aug 29 09:22:34.758 [conn17407] query sodibee_development.
transactions query: { $query: { account_id:
ObjectId('51148e0a45564c8ebf001e13') },
$orderby: { date: -1 } } ntoreturn:10000 ntoskip:0
nscanned:478513 scanAndOrder:1 keyUpdates:0 numYields: 3
locks(micros) r:4169618 nreturned:10000 reslen:472674 2548ms

What if we we could be the best of both worlds? Get faster results and get only relevant data! Lets see how the aggregation pipeline fares.

> Benchmark.realtime { Transaction.collection.aggregate(
{ '$match' => { 'account_id' => "50f8f036da58d2462900079c" },
{ '$limit' => 10000 },
{ '$sort' => { 'date' => 1 } },
{ '$group' => { '_id' => '$trx_type',
 data: { '$addToSet' => { 'id' => '$_id' } } } } ) }

=> 0.430333684

That is incredible — the entire operation finished in 0.4 seconds! As we look at the mongoDB log, we also see that the data being sent back is only 268k in 380ms!

Thu Aug 29 09:18:48.075 [conn17407] command
sodibee_development.$cmd command: { aggregate: "transactions",
pipeline: [ { $match: { _type: "Receipt", account_id:
ObjectId('51148e0a45564c8ebf001e13') } }, { $limit: 10000 },
{ $sort: { date: 1 } }, { $group: { _id: "$trx_type", data:
{ $addToSet: { id: "$_id" } } } } ] } ntoreturn:1 keyUpdates:0
numYields: 1 locks(micros) r:500062 reslen:268980 380ms

One more interesting feature of the aggregation pipeline is the ability to manipulate the data the way we want. Lets take a look at the $group operator.

{ '$group' => { '_id' => '$trx_type',
        data: { '$addToSet' => { 'id' => '$_id' } } } } ) }

Here, after the data has been piped through the match, limit and sort pipes, we finally group the data. We choose the grouping based on $trx_type and the data for the key is a hash { ‘id’ => => ‘$_id’ }. This data is prepared by MongoDB and the hash is converted ‘as is’ and returned to us via Mongoid. That is why the processing is next to nothing.

Now, using this hash, we can do what we want with the data –

  • fetch objects in batches
  • fetch objects by transaction type ($trx_type)
  • Send these ids to resque / sidekiq for further processing!

The ‘aggregate’ method is still not directly available on a Model, so as we can see, we have to access the collection to access the aggregate method. The code above also looks very much like writing direct MongoDB queries — and if you prefer using Mongoid Criteria, you could write the code in a slightly “mongoid” way. We could have the “$match” rewritten using a criteria like this

{ '$match' => Transaction.
     where(account_id: "50f8f036da58d2462900079c").selector }

About Gautam Rege

Rubyist, Entrepreneur and co-founder of Josh-Software - one of the leading Ruby development shops in India.
This entry was posted in MongoDB, Ruby and tagged , , . Bookmark the permalink.

2 Responses to Mongoid and the MongoDB Aggregation Framework

  1. Pingback: Mongoid and the MongoDB Aggregation Framework – MongoDB Studio

  2. Robert Reiz says:

    Thanks for this post. I used this instead of map and reduce. It save me a lot of time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s