Using Map Re-reduce In MongoDB To Tune The Performance

“Why do we use map reduce in MongoDB? Cos it doesn’t support joins.” – And everybody believes that now! I decided that it was time to try something different. I was amazed to see how little there is on the net about map re-reduce. And there is even lesser information about how to work with multiple collections using map reduce. Here goes nothing!

The problem we faced

One of our projects has Bookings and Vouchers, with the models that look like this

class Booking
include Mongoid::Document
# ...
field :booking_amount, type: Float
field :is_cancelled, type: Boolean

has_many :vouchers
# ...
end

class Voucher
include Mongoid::Document
# ...
field :amt_received, type: Float

belongs_to :booking
# ...
end

We needed an aggregation of all the Booking#booking_amount that were not cancelled i.e. Booking#is_cancelled != false and the aggregation of their corresponding Voucher#amt_received – basically a real-time sales report.

Solution 1: Ruby object iteration

As goes with all basic solutions, we took the Ruby approach. It was not bad 😉 The entire gist log is here and a gist of the gist is here:

Started POST "/reports/distributor" for 127.0.0.1 at 2013-03-19 15:02:16 +0530
Processing by ReportsController#distributor as HTML
...
Completed 200 OK in 4201ms (Views: 850.8ms)

This had 714 queries and took 3.5 seconds of server-side processing! This is crazy as the page would load in about 6 seconds. And the performance will degrade over time as we start iterating more Ruby Objects. NOT ACCEPTABLE.

Solution 2: Map/Reduce to the rescue

Here we had to work with these constraints:

We cannot fire any database operations in the map, reduce or finalize functions.
The reduce return result should have the same type of values returned as the emit.
The reduce result should have consistent result with all types of input values (Idempotent behaviour)

The big challenge here was that we needed an aggregation from different collections, just like a join. So, we fired 3 map/reduce operations.

Map and reduce the booking collection data

This map/reduce function worked on the bookings collection and reduced an aggregated result of all the relevant bookings. However, what we also collected were the booking ids themselves (we shall see how we use it soon). We stored the result in a collection ‘distributorReport’. Here is what the basic map/reduce functions looked like

map = function() {
if (this.is_cancelled == false) {
data = { bookings: [ this._id ],
booking_amount: this.booking_amount,
cancellations: 0
}
emit(this.resource_id, data);
} else {
emit(this.resource_id, { cancellations: 1 });
}
};

This is pretty standard except that we are also emitting the booking id itself in an Array. The reason for the array is to be consistent result between emit and reduce results!

reduce = function(key, values) {
var r = { bookings: [], booking_amount: 0, amt_received: 0, cancellations: 0 }
values.forEach(function(value) {
if (value.booking_amount) { r.booking_amount += value.booking_amount; }
if (value.amt_received) { r.amt_received += value.amt_received }
if (value.cancellations) { r.cancellations += value.cancellations }

if(value.bookings && value.bookings.length > 0) {
r.bookings = r.bookings.concat(value.bookings);
}
});
return r;
};

This is also a standard reduce function BUT its generic in nature. The conditional checks are because we may get a subset of the entire result value (see Caveats at the end of this post for more detail). As we shall soon see, the same reduce function is used again when we re-reduce the result. Here is the map/reduce invocation.

db.bookings.mapReduce(map, reduce, {
query: { resource_type: "Distributor" },
out: 'distributorReport' }
);

As we can see the distributorReport collection is ‘replaced’ by default.

Collect the booking ids from the reduced result

Since we want to aggregate only the data from the relevant vouchers, we need to collect all the booking ids from the reduced result. Here goes another map/reduce

mapStat = function() {
emit('booking_ids', { bookings: this.value.bookings });
}

reduceStat = function(key, values) {
r = { bookings: [] }
values.forEach(function(value) {
r.bookings = r.bookings.concat(value.bookings);
});
return r;
}

db.distributorReport.mapReduce(mapStat, reduceStat,
{ out: "tmpBookings" } );

This is a really easy one – very straightforward map/reduce. Issuing a second map/reduce gets us all the booking object ids in a temporary collection. Note that the collection has only 1 key, so there is no real ‘reduce’, its a direct aggregation of results.

The pseudo join – re-reduce the result

This is the interesting part. We can now fire a neat query for specific vouchers and reduce the result.

mapVoucher = function() {
emit(this.resource_id, { amt_received: this.amt_received } );
}

Here is the simple emit from the voucher. Its very important to note that the emit key has the same value as that in the booking collection.

db.vouchers.mapReduce(mapVoucher, reduce, {
query: { resource_type: "Distributor",
booking_id: { $in: db.tmpBookings.findOne().value.bookings }     },
out: { reduce: 'distributorReport'}
}
);

The above map/reduce query is the crux of the re-reduce. The code below makes it a rather large $in query but we do all the processing only on relevant data.

db.tmpBookings.findOne().value.bookings

This out option does the re-reduce magic.

out: { reduce: 'distributorReport'}

Notice that we are ‘reducing’ the result back into the same collection that we used for the first map/reduce of the booking collection. This ‘reduce’ option invokes the reduce method on the current collection. So, the reduce function merges the new emit values into the ‘distributorReport’ i.e. the earlier reduced result. (thats why the emit key MUST be the same). So, only the amt_received is updated and leaves the other values alone. We ‘reduced’ the reduced result – simply put, re-reduced it.

This is in principle analogus to a join in an RDBMS system. Though, we are using the $in query (so, its more like eager loading) the results have a drastic improvement. The log of map/reduce gist is here – 17 queries and 200ms of server side processing.

Started POST "/reports/distributor" for 127.0.0.1 at 2013-03-19 20:28:49 +0530
Processing by ReportsController#distributor as HTML
...
Completed 200 OK in 1232ms (Views: 905.5ms)

Caveats

We are firing an $in query, so its the query can be quite large if there are a LOT of documents to be processed. I am trying to figure out a way that can be solved.

The reduce function will not be called for a key that has a single value i.e. if there is only one emit for a key, its stored directly the output directly, it will not be reduced. In the example above, if there is a single cancellation and no other bookings, the following value will be saved ‘as is’ in the output collection causing breach of idempotent rule.

emit(this.resource_id, { cancellations: 1 });

So, we have to be consistent in the value of every emit. We learnt this the hard-way. The correct map function should be

map = function() {
if (this.is_cancelled == false) {
data = { bookings: [ this._id ],
booking_amount: this.booking_amount,
amt_received: 0,
cancellations: 0
}
} else {
data = { bookings: [],
booking_amount: 0,
amt_received: 0,
cancellations: 1
}
emit(this.resource_id, data);
};

This is the complete gist for the map reduce operation for use.

Let the joins begin!